This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Multisite medical data sharing is critical in modern clinical practice and medical research. The challenge is to conduct data sharing that preserves individual privacy and data utility. The shortcomings of traditional privacy-enhancing technologies mean that institutions rely upon bespoke data sharing contracts. The lengthy process and administration induced by these contracts increases the inefficiency of data sharing and may disincentivize important clinical treatment and medical research. This paper provides a synthesis between 2 novel advanced privacy-enhancing technologies—homomorphic encryption and secure multiparty computation (defined together as multiparty homomorphic encryption). These privacy-enhancing technologies provide a mathematical guarantee of privacy, with multiparty homomorphic encryption providing a performance advantage over separately using homomorphic encryption or secure multiparty computation. We argue multiparty homomorphic encryption fulfills legal requirements for medical data sharing under the European Union’s General Data Protection Regulation which has set a global benchmark for data protection. Specifically, the data processed and shared using multiparty homomorphic encryption can be considered anonymized data. We explain how multiparty homomorphic encryption can reduce the reliance upon customized contractual measures between institutions. The proposed approach can accelerate the pace of medical research while offering additional incentives for health care and research institutes to employ common data interoperability standards.
The current biomedical research paradigm has been characterized by a shift from intrainstitutional research toward multiple collaborating institutions operating at an interinstitutional, national or international level for multisite research projects; however, despite the apparent breakdown of research barriers, there remain differences between ethical and legal requirements at all jurisdictional levels [
For example, the International Cancer Genome Consortium endeavors to amass cancer genomes paired with noncancerous sequences in a cloud environment, known as pancancer analysis of whole genomes. The International Cancer Genome Consortium’s data access compliance office was unable to establish an international cloud under the Pancancer Analysis of Whole Genomes Project because of conflicts between United States and European Union data privacy laws [
In this paper, we describe how traditional data-sharing approaches relying upon conventional privacy-enhancing technologies are limited by various regulations governing medical use and data sharing. We describe two novel privacy-enhancing technologies, homomorphic encryption and secure multiparty computation, that extend the capacity of researchers to conduct privacy-preserving multisite research. We then turn to analyze the effects of regulation on using these novel privacy-enhancing technologies for medical and research data sharing. In particular, we argue these privacy-enhancing technologies guarantee anonymity as defined under the EU GDPR and are, therefore, key enablers for medical data sharing. We focus on the GDPR, as it currently represents a global benchmark in data protection regulations. We argue that using these technologies can reduce the reliance upon customized data-sharing contracts. The use of standardized agreements for multiparty processing of data in concert with privacy-enhancing technologies can reduce the bottleneck on research. Finally, we turn to address how these novel privacy-enhancing technologies can be integrated within existing regulatory frameworks to encourage increased data sharing while preserving data privacy.
Before examining novel privacy-enhancing technologies, it is necessary to examine the main models for exchanging medical data for research purposes and the limitations of conventional privacy protection mechanisms that are currently used to reduce the risk of reidentification. We synthesize the data-sharing models into three categories and analyze their main technological issues (
Overview of the three main data-sharing models: (A) centralized, (B) decentralized (site-level meta-analysis), and (C) decentralized (federated learning).
The centralized model requires medical sites (ie, data providers) that are willing to share data with each other to pool their individual-level patient data into a single repository. The data repository is usually hosted by one medical site or by an external third party (eg, a cloud provider), playing the trusted dealer role. The main advantage of this model is that the trusted dealer enables authorized investigators to access all the patient-level information needed for data cleaning and for conducting statistical analysis. Moreover, such a data-sharing model minimizes infrastructure costs at medical sites, as data storage and computation are outsourced. However, from a data privacy perspective the centralized model is often difficult to realize, especially when medical and genetic data should be exchanged across different jurisdictions. The central site hosting the data repository represents a single point of failure in the data-sharing process. All participating sites must trust such single entity for protecting their patient-level data [
To minimize sensitive information leakage from data breaches, traditional anonymization techniques include suppressing directly identifying attributes, as well as the generalizing, aggregating or randomizing quasi-identifying attributes in individual patient records. In particular, the
However, given the increased sophistication of reidentification attacks [
As opposed to the centralized data-sharing model, the decentralized model does not require patient-level data to be physically transferred out of the medical sites’ information technology infrastructure. Medical sites keep control over their individual-level patient data and define their own data governance principles. For each clinical study, the statistical analysis is first computed on local data sets. The resulting local statistics are then sent to the site responsible for the final meta-analysis that aggregates the separate contribution of each data provider [
However, the sharing of only aggregate-level data does not guarantee patients’ privacy by itself. Some aggregate-level statistics may be too low for certain subpopulations (such as patients with rare diseases) and can be considered personally identifying. Moreover, in some circumstances aggregate-level data from local analyses can be exploited to detect the presence of target individuals in the original data set. For example, an attacker may already hold the individual-level data of 1 or several target individuals [
To address these inference attacks, clinical sites can anonymize their local statistics by applying obfuscation techniques that mainly consist in adding a certain amount of statistical noise on the aggregate-level data before transfer to third parties. This process enables data providers to achieve formal notions of privacy such as differential privacy [
Beyond privacy considerations, this approach also suffers from a lack of flexibility as the medical sites involved in the analysis must coordinate before the analysis on the choice of parameters and covariates to be considered. This coordination often depends on manual approval, impeding the pace of the analysis itself. Finally, as opposed to the centralized approach, accuracy of results from a meta-analysis that combines the summary statistics or results of local analysis can be affected by cross-study heterogeneity. This can lead to inaccurate and misleading conclusions [
The federated model is an evolution of the decentralized model based on site-level meta-analysis. Instead of sharing the results of local analyses, the participating data providers collaborate to perform a joint analysis or the training of a machine learning model in an interactive and iterative manner, only sharing updates of the model’s parameters. One of the medical sites participating in the multicentric research project (typically the site responsible for the statistical analysis) becomes the reference site (or central site) and defines the model to be trained (or analysis to be performed) and executed on the data distributed across the network. This model is referred to as the global model. Each participating site is given a copy of the model to train on their own individual-level data. Once the model has been trained locally over several iterations, the sites send only their updated version of the model parameters (aggregate-level information) to the central site and keep their individual-level data at their premises. The central site aggregates the contributions from all the sites and updates the global model [
With respect to the distributed data-sharing approach based on site-level meta-analysis, this federated approach is more robust against heterogeneous distributions of the data across different sites, thus yielding results accuracy that is comparable to the results obtained with the same analysis conducted using the centralized model. Moreover, this approach does not suffer from the loss in statistical power of conventional meta-analyses. Prominent projects that have attempted to employ federated approaches to analysis and sharing of biomedical data are the DataSHIELD project [
The federated data-sharing approach combines the best features of the other two approaches. However, although the risk or reidentification is reduced compared to the centralized approach, the federated approach remains vulnerable to the same inference attacks of the meta-analysis approach. These inference attacks exploit aggregate-level data released during collaboration [
Finally, regardless of the type of distributed data-sharing model, obfuscation techniques for anonymizing aggregate-level data are rarely used in practice in medical research because of their impact on data utility. As a result, these technical privacy limitations are usually addressed via additional legal and organizational mechanisms. For the DataSHIELD project, access is limited to organizations that have consented to the terms of use for DataSHIELD and have sought appropriate ethics approval to participate in a DataSHIELD analysis [
In the last few years, several cryptographic privacy-enhancing technologies have emerged as significant potential advances for addressing the above-mentioned data protection challenges that still affect medical data sharing in the decentralized model. Although hardware-based approaches could be envisioned for this purpose, they are usually tailored to centralized scenarios and introduce a different trust model involving the hardware provider. Furthermore, they also depend on the validity of the assumptions on the security of the hardware platform, for which new vulnerabilities are constantly being discovered. In this paper, we focus on two of the most powerful software-based privacy-enhancing technologies: homomorphic encryption and secure multiparty computation. Both rely upon mathematically proven guarantees for data confidentiality, respectively grounded on cryptographic hard problems and noncollusion assumptions.
Homomorphic encryption [
Secure multiparty computation [
The combination of secure multiparty computation and homomorphic encryption was proposed to overcome their respective overheads and technical limitations; we refer to it as multiparty homomorphic encryption [
Unlike homomorphic encryption or secure multiparty computation alone, multiparty homomorphic encryption provides effective, scalable, and practical solutions for addressing the privacy-preserving issues that affect the distributed or federated approach for data sharing. For example, systems such as Helen [
In this section, we focus on the features of EU data protection law concerning encryption and data sharing. We focus on the GDPR because of the persistence of national divergences in member state law, despite the passage of the GDPR. In particular, the GDPR provides member states can introduce further conditions, including restrictions on processing of genetic data, biometric data, or health-related data. These exceptions exist outside the narrow circumstances in which special categories of personal data, which genetic data, biometric data, or health-related data belong to, can be processed [
The GDPR defines
Spindler and Schmechel [
At the supranational level, the former Article 29 Working Party (now the European Data Protection Board) has favored a relative over an absolute approach to anonymization. First, the Article 29 Working Party held that the words “means reasonably likely” suggests a theoretical possibility of reidentification will not be enough to render those data personal data [
The GDPR’s provisions apply to data controllers, or entities determining the purpose and means of processing personal data. This definition encompasses both health care institutions and research institutions. Data controllers must guarantee personal data processing is lawful, proportionate, and protects the rights of data subjects. In particular, the GDPR provides that encryption should be used as a safeguard when personal data are processed for a purpose other than which they were collected. Although the GDPR does not define encryption, the Article 29 Working Party treats encryption as equivalent to stripping identifiers from personal data. The GDPR also lists encryption as a strategy that can guarantee personal data security. Furthermore, the GDPR emphasizes that data controllers should consider the state of the art, along with the risks associated with processing, when adopting security measures. The GDPR also provides that data processing for scientific purposes should follow the principle of data minimization. This principle requires data processors and controllers to use nonpersonal data unless the research can only be completed with personal data. If personal data are required to complete the research, pseudonymized or aggregate data should be used instead of directly identifying data.
The GDPR imposes obligations on data controllers with respect to the transfer of data, particularly outside of the European Union. Specifically, the GDPR requires the recipient jurisdiction to offer adequate privacy protection before a data controller transfers data there. Otherwise, the data controller must ensure there are organizational safeguards in place to ensure the data receives GDPR-equivalent protection. Furthermore, data controllers must consider the consequences of exchanging data between institutions, and whether these are joint controllership or controller–processor arrangements. Under the GDPR, data subject rights can be exercised against any and each controller in a joint controllership agreement. Furthermore, controllers must have in place an agreement setting out the terms of processing. By contrast, a data controller-processor relationship exists where a controller directs a data processor to perform processing on behalf of the controller, such as a cloud services provider. The GDPR provides that any processing contract must define the subject matter, duration, and purpose of processing. Contracts should also define the types of personal data processed and require processors to guarantee both the confidentiality and security of processing.
In this section, we argue that multiparty homomorphic encryption, or homomorphic encryption and secure multiparty computation used in concert, meets the requirements for anonymization of data under the GDPR. Furthermore, we argue the use of multiparty homomorphic encryption can significantly reduce the need for custom contracts to govern data sharing between institutions. We focus on genetic and clinical data sharing due to the potential for national derogations pertaining to the processing of health-related data. Nevertheless, our conclusions regarding the technical and legal requirements for data sharing using multiparty homomorphic encryption, or homomorphic encryption and secure multiparty computation, may apply to other sectors, depending on regulatory requirements [
Under the GDPR, separating pseudonymized data and identifiers is analogous to separating decryption keys and encrypted data. For pseudonymized data, any entity with physical or legal access to the identifiers will possess personal data [
Whether a party to data processing using advanced privacy-enhancing technologies has lawful access to data or decryption keys depends on the legal relationship between the parties. With respect to joint controllership, recent CJEU case law has established that parties can be joint controllers even without access to personal data [
Applying these principles to processing with privacy-enhancing technologies, for homomorphic encryption, there is no mathematical possibility of decrypting the data without the decryption key. This holds true when both the data are at rest or when the data are processed in the encrypted space via secure operations such as homomorphic addition or multiplication. Whether data processed as part of secure multiparty computation or multiparty homomorphic encryption remain personal data depends on whether entities have lawful access to personal data or decryption keys respectively. If entities can only access personal data they physically hold as part of a joint controller agreement, the data fragments exchanged during secret sharing via secure multiparty computation are not personal data. Likewise, under multiparty homomorphic encryption each individual entity only has access to a fragment of the decryption key, which can only be recombined with the approval of all other entities holding the remaining fragments. This argument is reinforced by Recital 57 of the GDPR [
Therefore, we submit that both homomorphic encryption and secure multiparty computation, when used alone or together through multiparty homomorphic encryption can jointly compute health-related data while complying with the GDPR. These data remain anonymous even though entities processing data using multiparty homomorphic encryption are joint controllers. Furthermore, the use of advanced privacy-enhancing technologies should become a best standard for the processing of health-related data for three reasons. First, the Article 29 Working Party has recommended using encryption and anonymization techniques in concert to protect against orthogonal privacy risks and overcome the limits of individual techniques [
Therefore, we argue that multiparty homomorphic encryption involves processing anonymized data under EU data protection law. Although homomorphic encryption, secure multiparty computation, and multiparty homomorphic encryption do not obliviate the need for a joint controllership agreement, they lessen the administrative burden required for data sharing. Furthermore, they promote the use of standard processing agreements that can help ameliorate the impacts of national differences within and outside the European Union. Accordingly, we submit that multiparty homomorphic encryption, along with other forms of advanced privacy-enhancing technologies, should represent the standard for health data processing in low trust environments [
Comparison of the status of personal data under a distributed approach relying upon traditional privacy-enhancing technologies (eg, aggregation and pseudonymization) and a distributed approach relying on multiparty homomorphic encryption (eg, homomorphic encryption and secure multiparty computation).
Data status at different stages of processing.
Scenario | Description | Status of data based on the scenario |
A | Hospital/research institution physically holds personal data | Personal data |
B | Hospital/research institution has legal access to decryption key/personal data | Pseudonymized data |
C | Hospital/research institution combine decryption keys/personal data to process data | Anonymized data |
D | Third party (cloud service provider) carries out processing, hospitals share encryption keys jointly | Anonymized data |
The lack of reliance upon custom contracts may encourage institutions to align their data formats to common international interoperability standards. In the next section, we turn to address the standardization of these advanced privacy-enhancing technologies.
At present, regulatory instruments provide limited guidance on the different types of privacy-enhancing technologies required to process medical data in a privacy-conscious fashion. However, the techniques described in this paper may represent a future best standard for processing medical data for clinical or research purposes. Because of the novelty of both technologies, the standardization of homomorphic encryption and secure multiparty computation is ongoing, with the first community standard released in 2018 [
Furthermore, there are numerous documents published by data protection agencies that can aid the development of such guidelines. For example, the
Nevertheless, any standards will need to be continually updated to respond to new technological changes. For example, one of the most significant drawbacks of fully homomorphic encryption is the complexity of computation. This computational complexity makes it hard to predict running times, particularly for low-power devices such as wearables and smartphones. For the foreseeable future, this may limit the devices upon which fully homomorphic encryption can be used [
A final consideration relates to ethical issues that exist beyond whether homomorphic encryption, multiparty computation, and multiparty homomorphic encryption involve processing anonymized or personal data. First, the act of encrypting personal data constitutes further processing of those data under data protection law. Therefore, health care and research institutions must seek informed consent from patients or research participants [
Medical data sharing is essential for modern clinical practice and medical research. However, traditional privacy-preserving technologies based on data perturbation, along with centralized and decentralized data-sharing models, carry inherent privacy risks and may have high impact on data utility. These shortcomings mean that research and health care institutions combine these traditional privacy-preserving technologies with contractual mechanisms to govern data sharing and comply with data protection laws. These contractual mechanisms are context-dependent and require trusted environments between research and health care institutions. Although federated learning models can help alleviate these risks as only aggregate-level data are shared across institutions, there are still orthogonal risks to privacy from indirect reidentification of patients from partial results [
Court of Justice of the European Union
coronavirus disease 2019
(European Union) General Data Protection Regulation
human immunodeficiency virus
We are indebted to Dan Bogdanov, Brad Malin, Sylvain Métille, and Pierre Hutter for their invaluable feedback on earlier versions of this manuscript. This work was partially funded by the Personalized Health and Related Technologies Program (grant 2017-201; project: Data Protection and Personalized Health) supported by the Council of the Swiss Federal Institutes of Technology.
None declared.