This is an openaccess article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
Data sharing in multicenter medical research can improve the generalizability of research, accelerate progress, enhance collaborations among institutions, and lead to new discoveries from data pooled from multiple sources. Despite these benefits, many medical institutions are unwilling to share their data, as sharing may cause sensitive information to be leaked to researchers, other institutions, and unauthorized users. Great progress has been made in the development of secure machine learning frameworks based on homomorphic encryption in recent years; however, nearly all such frameworks use a single secret key and lack a description of how to securely evaluate the trained model, which makes them impractical for multicenter medical applications.
The aim of this study is to provide a privacypreserving machine learning protocol for multiple data providers and researchers (eg, logistic regression). This protocol allows researchers to train models and then evaluate them on medical data from multiple sources while providing privacy protection for both the sensitive data and the learned model.
We adapted a novel threshold homomorphic encryption scheme to guarantee privacy requirements. We devised new relinearization key generation techniques for greater scalability and multiplicative depth and new model training strategies for simultaneously training multiple models through xfold crossvalidation.
Using a clientserver architecture, we evaluated the performance of our protocol. The experimental results demonstrated that, with 10fold crossvalidation, our privacypreserving logistic regression model training and evaluation over 10 attributes in a data set of 49,152 samples took approximately 7 minutes and 20 minutes, respectively.
We present the first privacypreserving multiparty logistic regression model training and evaluation protocol based on threshold homomorphic encryption. Our protocol is practical for realworld use and may promote multicenter medical research to some extent.
In recent years, researchers have proposed strong requirements for the quality of medical research as it continues to progress, which has promoted the development of multicenter research. Compared with singlecenter research, multicenter research has many significant advantages, including enabling specific analyses for which no single institution has sufficient data, such as on a rare disease; providing medical data from different locations with diverse demographics, which increases the reproducibility and generalizability of the research; and generating pooled medical data that enables new discoveries that cannot be elucidated from any individual data set [
However, data sharing during multicenter research may increase privacy security risks. As medical data are highly sensitive, the leakage of sensitive information will lead to severe consequences, such as financial loss, social discrimination, and unauthorized data abuse, which can harm both patients and medical institutions [
Logistic regression is a widely used machine learning approach in various medical applications, such as prognostic prediction, disease diagnosis, and decisionmaking support [
Multikey homomorphic encryption, first proposed by LópezAlt et al [
In this study, we propose a privacypreserving multicenter research protocol using secure logistic regression, consisting of 3 primary entities: researchers, a service provider, and data providers, in which medical data are horizontally distributed. Our proposed protocol supports not only model training but also the evaluation of the trained model in a secure manner. The protocol guarantees the privacy of both the sensitive data for the data providers and the trained model for the researchers during model training and trained model evaluation. To satisfy privacy requirements, we apply threshold homomorphic encryption and propose a new relinearization key generation process that increases scalability and multiplicative depth. The proposed protocol has been implemented and tested with simulated reallife scenarios. The experimental results demonstrate that our protocol is efficient and practical for realworld applications.
Our proposed protocol includes 3 primary entities as shown below. The architecture of the proposed protocol is shown in
The architecture of the proposed protocol, containing 3 entities: data providers, a service provider, and researchers.
These include institutions (eg, hospitals) who hold medical data and are willing to provide these data to the service provider for public use so long as the privacy of the data is preserved. To share medical data, the data providers must obtain patient consent if the local law requires so. Upon receiving the researchers’ requests from the service provider, the data providers can decide whether to accept or refuse. To allow researchers to obtain correct research data, all data providers must implement data standardization to transform the data into a common format, such as the Observational Medical Outcomes Partnership common data model from the Observational Health Data Sciences and Informatics collaborative [
This refers to an entity that (1) provides storage for encrypted data and research information, (2) performs the most computationally expensive part of the privacypreserving logistic regression, and (3) performs information transfer among the data providers, the service provider, and the researchers. In addition, an interactive website is deployed by the service provider for researchers to conduct their studies in a secure manner and for data providers to authorize certain research requests.
This includes the individuals or organizations who want to conduct research on multiple data providers’ data sets. Researchers submit their requests to the service provider, which are then sent to the data providers for further processing.
As we use threshold homomorphic encryption to guarantee data and model security, in our proposed protocol, one public key corresponds to multiple secret keys, and different secret keys are distributed to different data providers and researchers. Furthermore, we assume that there exist at least one honest party and some semihonest adversaries that are capable of reading the internal information of the colluding parties while not deviating from the defined protocol [
Logistic regression is a classification algorithm that is widely used in medicine, including for disease diagnosis, clinical decision support, and risk assessment. Suppose a data set consists of pairs (
In the sigmoid function σ(
Homomorphic encryption is a special type of encryption scheme that allows computations on ciphertexts without the need to access a secret key. Once the result of the computation is decrypted, it matches the result of the operations as if they were performed on the plaintext.
In our proposed protocol, we use a ring learning with errors (RLWE)–based, somewhat homomorphic encryption scheme, called Brakerski/FanVercauteren (BFV) and which supports a limited number of additions and multiplications, to perform secure multiparty logistic regression [
The details of the threshold variant of the BFV scheme are described as follows. The security and noise analysis of the scheme are provided in
setup(1^{λ}): takes the security parameter λ as an input and returns the public parameterization param, including the degree of polynomial modulus n, the coefficient modulus q, the plaintext modulus t, and the (key, error) distribution (D1, D2).
THE.keygenSP(param): the service provider samples a ← R_{q} and outputs it. Here, R_{q}=Z_{q}[x]/(x^{n}+1) is the ciphertext space of param.
THE.keygenSkpk(param, a): each party p_{i} samples s_{i} ← D1, e_{i} ← D2, sets si as its secret key and outputs its public key pk_{i}=[−(a · s_{i}+e_{i})]_{q}. Let subscript *_{co} denote the combined key. The combined public key pk_{co} among parties p_{1},...,p_{z} is then computed as follows:
THE.keygenRelin(param, s_{1},...,s_{z}): the parties together with the service provider generate the combined relinearization key rlk_{co}. As the generation of the relinearization key is rather complicated, we will show the details of this step later.
THE.encrypt(m, pk_{co}): This takes a polynomial m∈R_{t} as the input, where R_{t} is the plaintext space of the param. Let pk_{co}=(pk_{co}(0), pk_{co}(1)) and Δ=⌊q/t⌋, and sample u ← D1 and (e_{1}, e_{2}) ← D_{2}, then return:
THE.eval(C, rlk_{co}, c_{1},...,c_{c}): given a circuit C, a tuple of ciphertexts encrypted by the same public key, and the corresponding relinearization key, this outputs a ciphertext c_{out}. The procedure for homomorphic addition and multiplication is the same as that in the original singlekey BFV scheme.
THE.decrypt(c, s_{1},...,s_{z}): given the ciphertext c=(c(0), c(1)) encrypted by pk_{co} and the corresponding secret keys, sample (e_{1}, e_{z}) ← D_{smg}. Here, the subscript *_{smg} means that the variance of the noise distribution is much larger than that of the input ciphertext noise distribution to guarantee circuit privacy through smudging techniques [
These shares are sent to the party that requires the unencrypted result. The decryption result
The workflow of our proposed protocol consists of 5 major steps, as shown in
Workflow of the proposed protocol.
The service provider initializes the BFV homomorphic encryption parameters. These parameters should be carefully selected because they affect many aspects of the encryption scheme, such as operational performance, security level, multiplicative depth of the circuit, and space consumption. Two sets of parameters must be initialized by the service provider, one for the privacypreserving logistic regression—
To make the encryption scheme practical, these parameters should meet the following criteria. First, the degree of polynomial modulus
Furthermore, to generate relinearization keys safely and correctly, the 2 sets of parameters must satisfy the following requirements: (1) their polynomial moduli must share the same degree and (2) the plaintext modulus in
The research application consists of several message transfers among the data providers, service providers, and researchers. First, a researcher visits the website deployed by the service provider and sets up a new research study. When the research begins, 3 settings must be confirmed by the researcher: first, the query condition used to obtain the research data; second, the list of data providers from which the researcher wishes to obtain the research data; finally, the settings of the secure logistic regression, including the variables to be used as features and the variable to be used as a class label and the settings of the maximum number of iterations, learning rate, and termination condition of the model training. This information is stored in the database of the service provider and sent to the corresponding data providers as a research request. After receiving the request, the data providers decide whether to authorize this research and send their decision to the service provider to inform the corresponding researcher about the authorization status.
Once the data providers complete the research authorization, key generation is implemented by an interactive protocol among all parties, which comprises 2 steps—THE.keygenSP and THE.keygenSkpk. After this procedure, each party
The data preparation phase then begins, which is described as follows:
The data provider generates their own research data according to the query condition of the research. Next, all the floatingpoint numbers in the research data are scaled and rounded into integers because all the operations in the BFV scheme are integer based. Categorical features are encoded as integers if they are Boolean or ordered; otherwise, onehot encoding is implemented.
The data provider encodes the research data by CRT batching. As mentioned before, we can pack multiple values into one polynomial and apply operations to them in an SIMD manner via CRT batching. This means that when given a data set with d features and N samples, one can pack them into d+1 polynomials (d features and 1 class label) as long as the degrees of the polynomial moduli are larger than N.
The data provider encrypts all the CRTbatched polynomials using the combined public key pk1_{co}. After all the plaintext polynomials are encrypted, they are sent to the service provider.
After data preparation, the researcher, and all involved data providers together with the service provider generate the combined relinearization key. The relinearization step is not necessary for the correctness of homomorphic multiplication but is essential in our thresholdvariant BFV scheme. By performing relinearization after every homomorphic multiplication, the size of the ciphertext can be strictly kept at 2, which simplifies decryption.
The relinearization key generation procedure is illustrated next. We denote the number of parties by
Each party p_{i} performs THE.encrypt(s1_{i}, pk2_{co}) and outputs k ciphertexts, of which the plaintext modulus is a group of primes whose product is the coefficient modulus in param1. The ciphertexts of secret key c_{j}(s1_{i}) (j=1,...,k) are then sent to the service provider.
The service provider computes the ciphertexts of the combined secret key c_{j}(s1_{co}) (j=1,...,k) and sends them to the data provider and researcher:
Each party p_{i} computes the ciphertexts of the product of the combined secret key and its secret key from
Here,
The service provider computes the ciphertexts of the square of the combined secret key c_{j}(s1_{co2}) (j=1,...,k) as follows:
Having encrypted the combined secret key and its square, the service provider defines the decomposition bit count
The encrypted combined relinearization key is then generated as follows: all parties perform THE.decrypt(
Secure logistic regression model training begins once all the encrypted research data and the combined relinearization key are sent to the service provider. We choose the gradient descent algorithm to train the model with homomorphically encrypted data because we can implement the algorithm using only addition and multiplication, which all fully and somewhat homomorphic encryption schemes naturally have, whereas despite its faster convergence, Newton method requires matrix inversion, which may have a very high time cost under the homomorphic encryption computation [
After choosing the proper training method, another major problem is the evaluation of the sigmoid function σ(
As the BFV scheme is based on integers, we apply scaling factor (SF) to scale up the floatingpoint number
The integerized function output is then transformed into an original function:
We now describe the detailed process of secure logistic regression. Before training begins, the involved data providers divide their own research data into 10 folds from
Next, the information is encoded into a vector of values (1, 2, 3, 4, 5, 1, 6, 7, 8, 4, 8, 9, 3, 7, 10, 6, 2, 9, 10, 5). The vector can be viewed as a special column of research data, although this column is not used in the computation of the approximation sigmoid function.
When all the data providers finish dividing their research data, they send these vectors to the service provider. As these vectors do not contain any sensitive information, they do not need to be further encoded into CRTbatched polynomials and encrypted.
After all preparations are completed, the model training begins, as shown in
Input:
Output:
Researcher does:
1: For
2:
3: For
4:
5: Foreach element
6:
7: End foreach
8:
9:
10: Wait for encrypted gradient calculation
11: Wait for securely decryption of encrypted gradients
12:
13: End for
14:
15: If (
16: return
17: End if
18:End for
Input:
Output:
Service provider does:
1:
2:
3:
Input:
Output:
All data providers do:
1:
2:
3: For
4: For
5:
6: End for
7: End for //
8:
9:
Service provider does:
10:
All parties do:
11:
Researcher does:
12:
13:
14:For
15: For
16:
17: End for
18:End for
19:
Once the model training is completed, all involved data providers encode their own research data for each fold into CRTbatched polynomials whose slots are randomly chosen to contain samples. In the meantime, the data providers also generate vectors containing information about whether a certain slot contains a sample and encode them into CRTbatched polynomials. For instance, for a CRTbatched polynomial containing samples in slots (1, 6, 8), the vector should be (1, 0, 0, 0, 0, 1, 0, 1). These polynomials are then encrypted by
When all the aforementioned preparations are completed, the model evaluation starts, as shown in
Input:
Output:
Researcher does:
1: For
2:
3: Wait for masked predictive values
4: For
5:
6: Foreach predictive value
7:
8: End foreach
9:
10:
11: Wait for masked model evaluation results
12:
13:
14: output
15: End for
16:End for
Input:
Output:
Service provider does:
1:
2:
All data providers do:
3:
4:
5:
Service provider does:
6:
All parties do:
7:
Input:
Output:
All data providers do:
1:
2:
3:
Service provider does:
4:
All parties do:
5:
Once the model evaluation ends, the researcher obtains the number of true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs) for the 10 folds and different predictive value thresholds, which should be sufficient to evaluate the trained model via 10fold crossvalidation.
In this section, we consider the following aspects to assess the performance of our proposed multicenter secure logistic regression protocol: (1) Security analysis: security of sensitive research data and learned model; (2) accuracy loss: the loss in accuracy during the model training and evaluation with respect to the nonsecure method with real medical data; (3) model training and evaluation time: the time needed to perform 10fold crossvalidation with real medical data; and (4) scalability: how the model training and evaluation time increases as the size of the data increases in the synthetic data set.
The biomedical data sets used for the experiments are shown in
Description of the data sets.
Data sets  SEER^{a} CRC^{b} data [ 
UCI^{c} breast cancer [ 
Samples, n  49152  277 
Attributes, n  10  9 
Size of ciphertexts, MB  60.0  18.0 
^{a}SEER: surveillance, epidemiology, and end results.
^{b}CRC: colorectal cancer.
^{c}UCI: unique client identifier.
To set the homomorphic encryption parameters, we select the following parameters to guarantee sufficient security, as shown in
Select parameters for Brakerski/FanVercauteren homomorphic encryption.
Parameters 


Polynomial modulus  16,384  16,384 
Coefficient modulus  438bit integer  300bit integer 
Plaintext modulus  1125899904679937  Coefficient modulus of 
Key distribution  Uniform distribution {−1, 0, 1}  Uniform distribution {−1, 0, 1} 
Error distribution  Discrete Gaussian distribution, with σ=3.2  Discrete Gaussian distribution with σ=3.2 
Security level  128bit  192bit 
To simulate a realworld scenario, we place the data providers, the researcher, and the service provider on different machines. For the data providers and the researcher, we use PCs with a 2.2GHz Intel Core i78750H processor and 16.0 GB RAM (Windows 10 Enterprise). For the service provider, we use a server with a 2.3 GHz Intel Xeon Gold 6140 processor and 128.0 GB RAM (Linux 3.10.0). The secure logistic regression protocol is implemented in C++ using Microsoft SEAL v3.0 and is publicly available at GitHub [
In our protocol, security means that corrupted parties will not be able to obtain sensitive data or learned models from honest parties. Here, we show the security of our protocol from the following 2 aspects: (1) honest parties’ secret keys will not be obtained by the corrupted parties so that no ciphertext will be decrypted illegally, including the encrypted data, model parameters, and any other intermediate results and (2) if the researcher is an adversary, he or she cannot obtain any meaningful information about honest parties’ individuals from the unencrypted intermediate results.
To demonstrate the security of the secret keys, we use the simulation paradigm described in the study by Goldreich [
In the generation of the combined public key,
When 
Given the ciphertext
When considering the distribution of the simulated and real views alone, the RLWE assumption is sufficient to ensure the security of secret keys of
where
First, during model training, all data providers apply onetimeuse noise to mask the encrypted gradient before decryption, meaning that even if only one data owner is honest, it will not lead to the disclosure of the gradients of the individuals.
Second, during model evaluation, the researcher will inevitably obtain CRTbatched polynomials containing the predictive values for each sample. Given a masked predictive value
Here,
Furthermore, because the encrypted (TP, FP, TN, and FN) information of samples under different predictive value thresholds is also masked by all data providers before being sent to the researcher, the researcher cannot obtain the label of any specific sample.
In
Accuracy comparison between nonsecure and proposed secure logistic regressions.
Data sets  SEER^{a} CRC^{b} data  Breast cancer 
AUC^{c} (nonsecure)  0.703 (0.008)  0.728 (0.156) 
AUC (our protocol)  0.696 (0.008)  0.717 (0.164) 
.09  .88  
Accuracy (nonsecure)  0.620 (0.013)  0.664 (0.149) 
Accuracy (our protocol)  0.612 (0.013)  0.632 (0.155) 
.18  .64  
0.654 (0.012)  0.508 (0.198)  
0.649 (0.012)  0.505 (0.240)  
.42  .97 
^{a}SEER: surveillance, epidemiology, and end results.
^{b}CRC: colorectal cancer.
^{c}AUC: area under the curve.
^{d}
Average receiver operating characteristic curves of nonsecure and proposed secure logistic regressions. CRC: colorectal cancer; ROC: receiver operating characteristic; SEER: surveillance, epidemiology, and end results; UCI: University of California, Irvine.
Furthermore, in
βnew–βold ÷ βnew after 99 iterations (surveillance, epidemiology, and end results colorectal cancer data).
Learning rate  0.1  0.2  0.3  0.4 
Nonsecure  0.056  0.046  0.302  0.347 
Our protocol  0.061  0.052  0.047  —^{a} 
^{a}Fail to convergence.
We show the time consumption of the 10fold crossvalidation for the 2 different data sets in
Here, we compare our protocol with the SecureLR protocol by Jiang et al [
Time consumption of the proposed protocol.
Data sets  Iterations, n  Training time  Time per iteration (seconds)  Evaluation time 
SEER^{a} CRC^{b} data  45  7 min 29 seconds  9.98  20 min 27 seconds 
UCI^{c} breast cancer  45  4 min 24 seconds  5.87  14 min 28 seconds 
^{a}SEER: surveillance, epidemiology, and end results.
^{b}CRC: colorectal cancer.
^{c}UCI: unique client identifier.
To test our protocol’s scalability, we use a synthetic data set with different numbers of data providers and features, as shown in
Scalability of the proposed protocol for different numbers of data providers (9 features).
Data providers, n  Size of ciphertexts, MB  Iterations, n  Training time (computation)  Training time (transfer)  Evaluation time (computation)  Evaluation time (transfer) 
3  60.0  45  4 min 16 seconds  3 min 13 seconds  9 min 54 seconds  10 min 33 seconds 
5  100.0  45  6 min 26 seconds  3 min 13 seconds  15 min 24 seconds  10 min 39 seconds 
10  200.0  45  12 min 45 seconds  3 min 12 seconds  30 min 42 seconds  10 min 51 seconds 
15  300.0  45  19 min 5 seconds  3 min 13 seconds  45 min 54 seconds  11 min 3 seconds 
20  400.0  45  25 min 52 seconds  3 min 13 seconds  61 min 13 seconds  11 min 17 seconds 
Scalability of the proposed protocol for different numbers of features (3 data providers).
Features, n  Size of ciphertexts, MB  Iterations, n  Training time (computation)  Training time (transfer)  Evaluation time (computation)  Evaluation time (transfer) 
3  60.0  45  4 min 16 seconds  3 min 13 seconds  9 min 54 seconds  10 min 33 seconds 
5  100.0  45  8 min 30 seconds  6 min 23 seconds  10 min 22 seconds  10 min 53 seconds 
10  200.0  45  12 min 48 seconds  9 min 37 seconds  10 min 47 seconds  11 min 13 seconds 
15  300.0  45  16 min 54 seconds  12 min 50 seconds  11 min 16 seconds  11 min 32 seconds 
20  400.0  45  21 min 13 seconds  16 min 10 seconds  11 min 40 seconds  11 min 53 seconds 
As researchers cannot obtain unencrypted research data, they may have difficulty choosing the proper hyperparameters, especially the learning rate. Despite a slightly broader range of learning rate selection, the setting of the learning rate is still very important in our privacypreserving multicenter logistic regression protocol because compared with the nonsecure protocol, our protocol still has a considerable time cost. In our proposed protocol, interactions exist among the service provider, the data providers, and the researcher, allowing the researcher to obtain the plaintext model parameters in every iteration. As a result, the researcher can easily judge whether the hyperparameters are set properly according to the trend of the model parameters. Moreover, the researcher can halt the model training in the early stages, which results in less waste of computational resources. However, to implement the webbased protocol, clients must be installed on all the data providers’ and researchers’ machines, which must be kept online during the entire process of model training and model evaluation, leading to an additional consumption of network bandwidth.
There is a tradeoff between computation and transfer consumption in our protocol. Although some solutions use fully homomorphic encryption to avoid decryption during model training [
Our proposed protocol has a few limitations. First, to make the privacypreserving logistic regression realistic, this protocol requires a highspeed and stable network. Second, as the BFV scheme is based on integers, before encryption, all floatingpoint numbers must be scaled up and rounded to integers. A larger SF can support a higher level of precision but will also result in higher computation and storage costs for a given security level. Third, in a realworld scenario, a single patient may have multiple medical records across different data providers, which rarely occurs when data providers are far apart but is not uncommon when data providers are located in the same region (eg, a city). Therefore, in the latter case, further research on privacypreserving identification and deduplication is required to ensure that there are no duplicate medical records to affect the analysis results. Furthermore, this study mainly focuses on technical issues and thus does not delve into matters related to ethics and law, which are also very important in multiparty medical research.
In this paper, we propose the first privacypreserving multiparty logistic regression model training and evaluation protocol based on threshold homomorphic encryption. We conduct experiments in simulated reallife scenarios, and the results demonstrate that the proposed protocol is practical for realworld use. We believe that our work can help medical institutions eliminate privacy leakage concerns during data sharing, promote multicenter medical research, and thus improve the use of medical data to some extent.
In the future, we will extend our tools to be more practical. As the BFV homomorphic encryption scheme does not have indistinguishability under chosen ciphertext attack security, additional security technology, such as hashing, should be integrated into the tools to prevent malicious attackers from tampering with the ciphertexts. More privacypreserving statistics and machine learning methods will be added to our tools to facilitate considerably enhance flexibility in secure multicenter research. Furthermore, we will improve the efficiency of our tools using graphics processing unit or field programmable gate array acceleration.
Details of used biomedical data, details of Brakerski/FanVercauteren (BFV) threshold homomorphic encryption, security analysis, and noise analysis.
Brakerski/FanVercauteren
CheonKimKimSong
Chinese remainder theorem
false negative
false positive
number theoretic transform
ring learning with errors
scaling factor
software guard extensions
single instruction, multiple data
true negative
true positive
This work was supported by the National Natural Science Foundation of China (under Grant 81771936 and 81801796), the Major Scientific Project of Zhejiang Laboratory (under Grant 2018DG0ZX01), the National Key Research and Development Program of China (under Grant 2018YFC0116901), and the Fundamental Research Funds for the Central Universities, China (No. 2020QNA5031).
The study concept and design were given by YL and TZ. Implementation and experiments of the study were carried out by YL. Drafting of the manuscript was carried out by YL and YT. Discussion, critical revision, and final approval of the version to be published were performed by JL, SZ, YT, TZ, and YL.
None declared.