P4P: A Practical Framework for Large-Scale Privacy-Preserving Distributed Computation

Overview

Imagine the scenario where a large group of users want to mine their collective data. This could be a community of movie fans extracting recommendations from their ratings, or a social network voting for their favorite members. These, and many other existing and emerging applications and services (such as e-voting, e-commerce, e-bidding, data mining etc.), call for collaborative computation on shared user data. In all the cases, the users may wish not to reveal their private data, not even to a "trusted" service provider, but still obtain verifiably accurate results. Failure to provide adequate privacy protection may hinder the acceptance of such applications.

The major challenges in augmenting such applications with privacy are the scales of the problem and the need to deal with cheating users. Typically the quality of the result increases with the size of the data (both the size of the user group and the dimensionality of per user data). Nowadays it is common for commercial service providers to run algorithms on data set collected from thousands or even millions of users. For example, the well-publicized Netflix Prize data set consists of roughly 100M ratings of 17,770 movies contributed by 480K users. At such a scale, existing solutions for both private computation and verifying proper behavior suffer from prohibitive cost and are not practical. In other words, privacy technologies fail to catch up with data mining algorithms's appetite and processing capability for large data sets.

We strive to change this. Our goal is to provide a privacy solution that is practical for many (but not all) real-world applications at reasonably large scale. We introduce a framework called Peers for Privacy (P4P) which is guided by the natural incentives of users/vendors and today's computing reality. On a typical computer today there is a six orders of magnitude difference between the crypto operations in large field needed for secure homomorphic computation (order of milliseconds) and regular arithmetic operations in small (32- or 64-bit) fields (fraction of a nano-second). Existing privacy solutions make heavy use of public-key operations for information hiding or verification. While they have the same asymptotic complexity as the standard algorithms for those problems, the constant factors imposed by public-key operations are prohibitive for large-scale systems. In contrast, P4P's main computation is based on verifiable secret sharing (VSS) over small field. This allows private arithmetic operations to have the same cost as regular, non-private arithmetic since both are manipulating the same-sized numbers with similar complexity. Moreover, such a paradigm admits extremely efficient zero-knowledge (ZK) tools that are practical even at large scale. Such tools are indispensable in dealing with cheating participants.

Why the Name?

The name has two meanings:

  • Peer Participation: P4P involves the participation of agents acting on behalf of users's privacy. We call them Privacy Peers. In some situations such agents can be selected from users. Most users want better privacy, but don't understand the risks and generally are not willing to pay much for better protection. On the other hand, user communities always contain a fraction of altruistic users who provide resources to the rest of the community - this is how most peer-to-peer computing actually works. P4P can tap into resources from a few of these users which allows a very simple, efficient solution that places little or no additional demands on either the server or the other users.

  • Peer Protection: During the computation, certain aggregate information is released. This is a very important technique that allows the private protocol to have high efficiency. We show that publishing such aggregate information does not harm privacy: individual traits are masked out in the aggregates and releasing them is safe. In other words, peers data mutually protects each other within the aggregates.

The Power of Addition

P4P gains its practicality by decomposing data mining algorithms into a sequence of vector addition steps that can be privately evaluated using a new verifiable secret sharing (VSS) scheme over small field (e.g., 32 or 64 bits), which has the same cost as regular, non-private arithmetic. Addition-based computations are more general than would appear, and supports a large number of popular data mining and machine learning algorithms, including linear algorithms, such as summation and voting, as well as non-linear ones such as SVD, PCA, k-means, ID3, machine learning algorithms based on Expectation Maximization (EM), etc., and all algorithms in the statistical query model. The non-linear algorithms aggregate data only in certain steps, such as conjugate gradient, which are linear in the data. Moreover, addition is extremely easy to parallelize so aggregating a large amount of numbers is straightforward.

Verifying Proper Behavior

In a realistic application, users are generally not trustworthy. Some may be incentivized to bias the computation, some may have their machines corrupted. The system muste have effective and efficient mechanisms to handle active cheating of a small number of users. The P4P framework supports a number of very efficient zero-knowledge protocols that verifies, in zero-knowledge (i.e., without leaking additional information about user data), a few properties of user input. These includes a L2-norm ZKP that ensures that the L2-norm of the vector a user inputs into the computation is bounded by a predefined limit L. This prevents a malicious user from exerting too much influence on the computation. The protocol uses only a logarithmic number of large field (1024 bits or more) operations. In experiments with our implementation, we show that verification of a million-element vector (e.g. during an SVD calculation) takes a few seconds of server or client time on commodity PCs as shown by the following figure (in contrast, using standard techniques takes hours). Other ZKPs include equality and consistency tests which use a constant number of large field operations. Overall, the protocol is dominated by the (linear) time to do small field operations that one has to pay even when the computation is done directly on user data without privacy. This makes privacy protection almost free from the vendor's point of view, which is essential for wide adoption.

Figure 1: (a) Verifier and (b) prover times in seconds for the validation protocol where (from top to bottom) L (the required bound) has 40, 20, or 10 bits. The x-axis is the vector length.

The P4P Toolkit

P4P also provides a set of tools that can be used to compose diverse applications. Such tools include data protection scheme, secure multicast, etc. All together, P4P provides a comprehensive and realistic solution for preserving privacy in real-world applications.

The basic P4P tools are being actively developed. Most components are ready. If you are interested in testing out the code, please send me an email: oldsky AT gmail DOT com.

People

    Yitao Duan
    (Update: Yitao has graduated and is working at NetEase Youdao R&D, Beijing, on a new Chinese search engine. But he is still maintaining the code. Free fell to drop him a message.)

    John Canny, Computer Science Division, University of California, Berkeley

Talks

  • CIKM 2009 Poster (November, 2009) (.ppt)
  • SDM 2008 (4/24/2008) (.ppt)
  • Dissertation Talk(11/27/2007) (.ppt)
  • PODC 2007 (8/12/2007) (.ppt)
  • INFOCOM 2007 (5/09/2007) (.ppt)
  • CT-RSA 2006 (2/16/2006) (.ppt)

Publications

  • Yitao Duan, John Canny and Justin Zhan. P4P: Practical Large-Scale Privacy-Preserving Distributed Computation Robust against Malicious Users. In The 19th USENIX Security Symposium, August 11-13, 2010, Washington, D.C. (.pdf)

  • Yitao Duan. Privacy without Noise. In The 18th ACM Conference on Information and Knowledge Management (CIKM) 2009. November 2-6, 2009. Hong Kong, China. Full version can be found here.

  • Yitao Duan and John Canny. How to Deal with Malicious Users in Privacy-Preserving Distributed Data Mining. Statistical Analysis and Data Mining, Volume 2, Number 1, 2009, Pages 18-33.

  • Yitao Duan and John Canny. Practical Private Computation and Zero-Knowledge Tools for Privacy-Preserving Distributed Data Mining. In SIAM International Conference on Data Mining (SDM08). April 24-26, 2008. Atlanta, Georgia, USA. (.pdf)

  • Yitao Duan and John F. Canny. Brief Announcement: Practical Private Computation of Vector Addition-Based Functions. PODC 2007: 326-327, August 12-15, Portland, OR, USA. Full version appears in SDM 08 and can be found here.

  • Yitao Duan and John Canny. Scalable Secure Bidirectional Group Communication. INFOCOM 2007, 6-12 May 2007, Anchorage, Alaska, USA.(.pdf)

  • Yitao Duan, John Canny and Justin Zhan. Efficient Privacy-Preserving Association Rule Mining: P4P Style. In IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007), April 1-5, 2007, Honolulu, Hawaii.

  • Yitao Duan and John Canny. From Commodity to Value: A Privacy-Preserving e-Business Architecture. In 2006 IEEE International Conference on e-Business Engineering (ICEBE 2006), Oct. 24 - 26, Shanghai, China. (.pdf. Full version here)

  • Yitao Duan and John Canny. Zero-knowledge Test of Vector Equivalence and Granulation of User Data with Privacy. In 2006 IEEE International Conference on Granular Computing (GrC 2006), May 10 - 12, Atlanta, USA. (.pdf)

  • Yitao Duan and John Canny. Practical Private Computation of Vector Addition-Based Functions or: Can Privacy be for Free?. UC Berkeley Technical Report No. No. UCB/EECS-2006-12 (.pdf)

  • Yitao Duan and John Canny. How to Construct Multicast Cryptosystems Provably Secure against Adaptive Chosen Ciphertext Attack. In CT-RSA 2006. February 13 - 17, 2006, McEnery Convention Center, San Jose, USA. (.pdf) Presentation slides can be found here.

  • Yitao Duan, Jingtao Wang, Matthew Kam, John Canny. Privacy Preserving Link Analysis on Dynamic Weighted Graph, Computational & Mathematical Organization Theory, Volume 11, Issue 2, Jul 2005, Pages 141 - 159.

  • Yitao Duan, Jingtao Wang, Matthew Kam and John Canny. A Secure Online Algorithm for Link Analysis on Weighted Graph, SIAM Workshop on Link Analysis, Counterterrorism and Security, (at the SIAM Int. Conf. on Data Mining), Sutton Place Hotel, Newport Beach, California, USA, 23rd April, 2005. (.pdf)

  • Yitao Duan and John Canny. Protecting User Data in Ubiquitous Computing: Towards Trustworthy Environments. Privacy-Enhancing Technologies (PET) 2004, Toronto, CA, May 2004. (.pdf)

  • Ben Y. Zhao, Yitao Duan, Ling Huang, Anthony D. Joseph, and John D. Kubiatowicz, Brocade: Landmark Routing on Overlay Networks, IPTPS'02. (.pdf)