P4P: A Practical Framework for Large-Scale Privacy-Preserving Distributed Computation
OverviewImagine the scenario where a large group of users want to mine their collective data. This could be a community of movie fans extracting recommendations from their ratings, or a social network voting for their favorite members. These, and many other existing and emerging applications and services (such as e-voting, e-commerce, e-bidding, data mining etc.), call for collaborative computation on shared user data. In all the cases, the users may wish not to reveal their private data, not even to a "trusted" service provider, but still obtain verifiably accurate results. Failure to provide adequate privacy protection may hinder the acceptance of such applications. The major challenges in augmenting such applications with privacy are the scales of the problem and the need to deal with cheating users. Typically the quality of the result increases with the size of the data (both the size of the user group and the dimensionality of per user data). Nowadays it is common for commercial service providers to run algorithms on data set collected from thousands or even millions of users. For example, the well-publicized Netflix Prize data set consists of roughly 100M ratings of 17,770 movies contributed by 480K users. At such a scale, existing solutions for both private computation and verifying proper behavior suffer from prohibitive cost and are not practical. In other words, privacy technologies fail to catch up with data mining algorithms's appetite and processing capability for large data sets. We strive to change this. Our goal is to provide a privacy solution that is practical for many (but not all) real-world applications at reasonably large scale. We introduce a framework called Peers for Privacy (P4P) which is guided by the natural incentives of users/vendors and today's computing reality. On a typical computer today there is a six orders of magnitude difference between the crypto operations in large field needed for secure homomorphic computation (order of milliseconds) and regular arithmetic operations in small (32- or 64-bit) fields (fraction of a nano-second). Existing privacy solutions make heavy use of public-key operations for information hiding or verification. While they have the same asymptotic complexity as the standard algorithms for those problems, the constant factors imposed by public-key operations are prohibitive for large-scale systems. In contrast, P4P's main computation is based on verifiable secret sharing (VSS) over small field. This allows private arithmetic operations to have the same cost as regular, non-private arithmetic since both are manipulating the same-sized numbers with similar complexity. Moreover, such a paradigm admits extremely efficient zero-knowledge (ZK) tools that are practical even at large scale. Such tools are indispensable in dealing with cheating participants.
Why the Name?The name has two meanings:
The Power of AdditionP4P gains its practicality by decomposing data mining algorithms into a sequence of vector addition steps that can be privately evaluated using a new verifiable secret sharing (VSS) scheme over small field (e.g., 32 or 64 bits), which has the same cost as regular, non-private arithmetic. Addition-based computations are more general than would appear, and supports a large number of popular data mining and machine learning algorithms, including linear algorithms, such as summation and voting, as well as non-linear ones such as SVD, PCA, k-means, ID3, machine learning algorithms based on Expectation Maximization (EM), etc., and all algorithms in the statistical query model. The non-linear algorithms aggregate data only in certain steps, such as conjugate gradient, which are linear in the data. Moreover, addition is extremely easy to parallelize so aggregating a large amount of numbers is straightforward.
Verifying Proper BehaviorIn a realistic application, users are generally not trustworthy. Some may be incentivized to bias the computation, some may have their machines corrupted. The system muste have effective and efficient mechanisms to handle active cheating of a small number of users. The P4P framework supports a number of very efficient zero-knowledge protocols that verifies, in zero-knowledge (i.e., without leaking additional information about user data), a few properties of user input. These includes a L2-norm ZKP that ensures that the L2-norm of the vector a user inputs into the computation is bounded by a predefined limit L. This prevents a malicious user from exerting too much influence on the computation. The protocol uses only a logarithmic number of large field (1024 bits or more) operations. In experiments with our implementation, we show that verification of a million-element vector (e.g. during an SVD calculation) takes a few seconds of server or client time on commodity PCs as shown by the following figure (in contrast, using standard techniques takes hours). Other ZKPs include equality and consistency tests which use a constant number of large field operations. Overall, the protocol is dominated by the (linear) time to do small field operations that one has to pay even when the computation is done directly on user data without privacy. This makes privacy protection almost free from the vendor's point of view, which is essential for wide adoption. Figure 1: (a) Verifier and (b) prover times in seconds for the validation protocol where (from top to bottom) L (the required bound) has 40, 20, or 10 bits. The x-axis is the vector length.
The P4P ToolkitP4P also provides a set of tools that can be used to compose diverse applications. Such tools include data protection scheme, secure multicast, etc. All together, P4P provides a comprehensive and realistic solution for preserving privacy in real-world applications. The basic P4P tools are being actively developed. Most components are ready. If you are interested in testing out the code, please send me an email: oldsky AT gmail DOT com.
(Update: Yitao has graduated and is working at NetEase Youdao R&D, Beijing, on a new Chinese search engine. But he is still maintaining the code. Free fell to drop him a message.) John Canny, Computer Science Division, University of California, Berkeley