Privacy-Preserving Methods for Feature Engineering Using Blockchain: Review, Evaluation, and Proof of Concept

Background The protection of private data is a key responsibility for research studies that collect identifiable information from study participants. Limiting the scope of data collection and preventing secondary use of the data are effective strategies for managing these risks. An ideal framework for data collection would incorporate feature engineering, a process where secondary features are derived from sensitive raw data in a secure environment without a trusted third party. Objective This study aimed to compare current approaches based on how they maintain data privacy and the practicality of their implementations. These approaches include traditional approaches that rely on trusted third parties, and cryptographic, secure hardware, and blockchain-based techniques. Methods A set of properties were defined for evaluating each approach. A qualitative comparison was presented based on these properties. The evaluation of each approach was framed with a use case of sharing geolocation data for biomedical research. Results We found that approaches that rely on a trusted third party for preserving participant privacy do not provide sufficiently strong guarantees that sensitive data will not be exposed in modern data ecosystems. Cryptographic techniques incorporate strong privacy-preserving paradigms but are appropriate only for select use cases or are currently limited because of computational complexity. Blockchain smart contracts alone are insufficient to provide data privacy because transactional data are public. Trusted execution environments (TEEs) may have hardware vulnerabilities and lack visibility into how data are processed. Hybrid approaches combining blockchain and cryptographic techniques or blockchain and TEEs provide promising frameworks for privacy preservation. For reference, we provide a software implementation where users can privately share features of their geolocation data using the hybrid approach combining blockchain with TEEs as a supplement. Conclusions Blockchain technology and smart contracts enable the development of new privacy-preserving feature engineering methods by obviating dependence on trusted parties and providing immutable, auditable data processing workflows. The overlap between blockchain and cryptographic techniques or blockchain and secure hardware technologies are promising fields for addressing important data privacy needs. Hybrid blockchain and TEE frameworks currently provide practical tools for implementing experimental privacy-preserving applications.


Proxy Re-encryption (PRE)
The primary application for PRE has been for secured distributed data storage, and research in the field has focused on improving security and performance [22], revocable access control through key rotation [23], and the ability to re-encrypt data multiple times using fully homomorphic encryption [15]. Recent implementations have included bolstering the technique with distributed consensus networks (i.e. blockchain) to decrease the trust required of any single proxy [24].

Secure Multiparty Computation (MPC)
A few practical applications and pilot projects that incorporate MPC include [21]: 1. Secured databases which can incorporate query re-writing over encrypted data stored on relational databases, or additive secret sharing over distributed databases for data analytics 2. Access key management 3. Statistical computations over private data The security of most MPC protocols is characterized by a security model which can tolerate a certain number of dishonest "adversaries". For example, SPDZ is a popular MPC protocol that can tolerate up to all but one party being dishonest [40].

Homomorphic Encryption (HE)
There are a handful of examples where HE has been applied to specific use cases for feature extraction when the encrypted data vector is an interesting feature itself. For example, deep neural networks have been applied directly to encrypted data in use cases such as biometric authentication [16] and optical character recognition [17].
FHE has also been used in conjunction with MPC to hide individual input data [40], while computing an aggregate result in the encrypted space. A shared decryption step reveals the result.
Since computational complexity is one of the main drawbacks for FHE, speeding up computations is an active research area, and some promising improvements have been achieved using GPUs to parallelize computations [20].

Zero-Knowledge Proof (ZKP)
Zero-knowledge proofs of knowledge are particularly applicable in problems dealing with identity and authentication.
Blum, Feldman and Micali developed a scheme where zero-knowledge was obtainable without interaction by sharing a common reference string between a Prover and a Verifier [43]. Another development coined zk-SNARKS has been incorporated into the verification of blockchain transactions, explained in the corresponding section.
In general, problems that require a statement of fact (e.g. some number N exists in the set of composite numbers) could addressed using a ZKP technique through the following steps: (i) representing the problem as a Boolean circuit, where the circuit is only satisfied if and only if you know the correct input, (ii) translating the circuit into graph problem, and (iii) solve the graph problem using the Goldreich, Micali and Wigderson (GMW) protocol [19,37]. However, these steps are technically so involved and complex as to make this infeasible for most applications. A few pilot projects incorporating TEEs include securing cryptocurrency wallets [29], running queries on genetic data [28], and blockchain based cloud computing platforms [30][31][32]. Since TEEs support running compiled software programs, it would also be possible to develop a feature extraction routine for GPS data that could be run on a TEE.

Private and Consortium blockchains
Several implementations that incorporate private blockchain address health data sharing needs such as data access to research study data [41] or data integrity and tamper-resistance [42]. The two most popular private blockchain platforms are forked Ethereum blockchains (adapted to run separately from the public, main Ethereum network), and Hyperledger.

Privacy-preserving blockchains incorporating TEEs
The Enigma platform [30,36] combines private off-chain data stored on a peer-to-peer distributed hash table, Multi-Party Computation on Intel SGX processor, and a public blockchain that holds reference to the data. The Ekiden / Oasis protocol [31,32] also uses a hybrid blockchain and TEE model, and aims to support multiple TEEs including Intel SGX and the opensourced Keystone project [25]. Both projects suggest that research is underway to incorporate cryptographic techniques like ZKPs and MPC in their protocols.