Keywords: Differential Privacy, Machine Learning
This GitHub repository contains a supplemental package of Python Jupyter notebooks for the Initial Public Draft of NIST Special Publication 800-226, Guidelines for Evaluating Differential Privacy Guarantees, that illustrate how to achieve differential privacy and other concepts described in the publication.
Notes: The Initial Public Draft of NIST Special Publication 800-226, Guidelines for Evaluating Differential Privacy Guarantees, is out for public comment through 11:59 p.m. EST on Thursday, January 25, 2024. Submit comments to privacyeng [at] nist.gov (privacyeng[at]nist[dot]gov). Submitted comments are encouraged to use this comment template.
Affiliation/Organization(s) Contributing: NIST, University of Vermont, Galois
GitHub POC: @davdar
NIST SP 800-226 Supplemental Material on GitHub Share Feedback
Keywords: Differential Privacy, Homomorphic Encryption, Machine Learning, Federated Learning
The goal of the U.S.-U.K. PETs Prize Challenges was to advance privacy-preserving federated learning solutions that provide end-to-end privacy and security protections while harnessing the potential of artificial intelligence (AI) for overcoming significant global challenges, such as financial crime prevention and pandemic forecasting and prevention. The winning solutions combined different PETs to allow the AI models to learn to make better predictions without exposing any sensitive data. The challenges took place from July 2022 through March 2023. U.S. solutions are listed in alphabetical order. Information on U.K. winners can be found here.
Name of Tool: HyFL framework for financial anomaly detection
ILLIDAN Lab won third place in Phase 2, Track A: Financial Crime Prevention of the Challenge. Their repository provides a framework HyFL as a tool to detect anomaly in financial transactions. This framework supports a hybrid federated learning paradigm to offer secure and privacy-aware learning and inference for financial anomaly detection.
Team Members: Jiayu Zhou, Haobo Zhang, Junyuan Hong, Steve Drew, Fan Dong
Affiliation/Organizations Contributing:
HyFL-Framework on GitHub Share Feedback
MusCat won second place in Phase 2, Track B: Pandemic Forecasting and Response of the Challenge. MusCAT is a multi-scale, hybrid federated system for privacy-preserving epidemic surveillance and risk prediction. It combines differential privacy, multiparty homomorphic encryption, and federated learning to jointly analyze private data held by multiple federation units with formal privacy guarantees. This software implements Team MusCAT's solution to the U.S. PETs Prize Challenge (Pandemic Forecasting). Team MusCAT won first place for the white paper (Phase 1) and second place in the final stage (Phase 2) of the Challenge.
Team Members: Hyunghoon Cho, David Froelicher, Denis Loginov, Seth Neel, David Wu, Yun William Yu
Affiliation/Organization(s) Contributing: Broad Institute, MIT, Harvard Business School, UT Austin, University of Toronto
MusCAT on GitHub Share Feedback
PPMLHuskies won second place in Phase 2, Track A: Financial Crime Prevention of the Challenge. PPMLHuskies propose a cross-silo federated architecture in which a payment network system (PNS) denoted by S has labeled data to train a model M for detection of anomalous payments. The other entities in the federation are banks B 1 , B 2 , … , B n that collaborate with S to create feature values to improve the utility of M. To jointly extract feature values in a privacy-preserving manner, S and the banks engage in cryptographic protocols to perform computations over their joint data, without the need for S and the banks to disclose their data in an unencrypted manner to each other, i.e. PPMLHuskies's solution provides input privacy through encryption, with mathematically verifiable guarantees. To the best of PPMLHuskies's knowledge, such joint privacy-preserving feature extraction in a federation with horizontally and vertically partitioned data is novel.
Furthermore, to prevent the model from memorizing instances from the training data, the model is trained with a machine learning (ML) algorithm that provides Differential Privacy (DP). PPMLHuskies's overall solution therefore provides both input privacy, as none of the entities in the federation ever sees the data of any of the other entities in an unencrypted manner, and output privacy, as the model and any inferences with that model avoid information leakage about the underlying training data under DP guarantees.
For the privacy-preserving feature extraction PPMLHuskies propose a custom protocol based on elliptic curve-based ElGamal and oblivious key-value stores (OKVS). The model is a neural network trained with DP-SGD. PPMLHuskies prove that their overall solution is secure in the honest-but-curious setting. Experimental results demonstrate that our solution is efficient and scalable, and that it yields accurate models while preserving input and output privacy.
Team Members: Martine De Cock, Anderson Nascimento, Sikha Pentyala, Steven Golob, Dean Kelley, Zekeriya Erkin, Jelle Vos, Célio Porsius Martins, Ricardo Maia
Affiliation/Organization(s) Contributing: University of Washington Tacoma, Universidade de Brasilia, TU Delft
PPMLHuskies on GitHub Share Feedback
puffle won first place in Phase 2, Track B: Pandemic Forecasting and Response of the Challenge. This tool contains the solution of team puffle at the US/UK PETs Prize Challenge that won 1st place in the Pandemic Forecasting Track. Their solution is a simple, general, and easy-to-use multi-task learning (MTL) framework that balances the interplay between privacy, utility, and data heterogeneity in private cross-silo federated learning. puffle's framework involves three key components: (1) model personalization for capturing data heterogeneity across data silos, (2) local noisy gradient descent for silo-specific, node-level differential privacy in contact graphs, and (3) model mean-regularization to balance privacy-heterogeneity trade-offs and minimize the loss of accuracy. Combined together, their framework can provide differential privacy with flexible data granularity and improved privacy-utility tradeoffs; has high adaptability to gradient-based learning algorithms; and is simple to implement and tune. puffle's solution is in part based on our NeurIPS'22 paper studying privacy and personalization in cross-silo federated learning.
Team Members: Ken Ziyu Liu, Shengyuan Hu, Tian Li, Steven Wu, Virginia Smith
Affiliation/Organization(s) Contributing: Carnegie Mellon University, School of Computer Science
puffle on GitHub Share Feedback
Name of Tool: Anomaly Detection via Privacy-Enhanced Two-Step Federated Learning
Scarlet-PETs won first place in Phase 2, Track A: Financial Crime Prevention of the Challenge. Scarlet-PETs developed a novel privacy-preserving (PP) two-step federated learning approach to identify anomalous financial transactions. In the first step, Scarlet-PETs performed PP feature mining for account-level banks’ data, followed by their augmentation to the payment network’s data using a PP encoding scheme. In the second step, a classifier is learned by the messaging network from the augmented data. A key benefit of their approach is that the performance in the federated setting is comparable to the performance in the centralized setting, and there is no significant drop in accuracy. Furthermore, this approach is extremely flexible since it allows the messaging network to adapt its model and features to build a better classifier without imposing any additional computational or privacy burden on the banks.
Team Members: Hafiz Asif, Sitao Min, Xinyue Wang, Jaideep Vaidya
Affiliation/Organization(s) Contributing: Rutgers University
Scarlet-PETs on GitHub Share Feedback
Visa Research won special recognition in Phase 2, Track A: Financial Crime Prevention of the Challenge. The solution folder consists of three components, a centralized solution, a federated solution, and the source code for an oblivious transfer shared library, which is used in Visa-Pets-FL's federated solution.
Team Members: Sebastian Meiser, Andrew Beams, Hao Yang, Yuhang Wu, Panagiotis Chatzigiannis, Srinivasan Raghuraman, Sunpreet Singh Arora, Harshal Shah, Yizhen Wang, Karan Patel, Peter Rindal, Mahdi Zamani
Visa-Pets-FL on GitHub Share Feedback
Keywords: Differential Privacy, Machine Learning
This work presents a novel algorithm called Approximate Minima Perturbation (AMP) for differentially private convex optimization, and an extensive empirical evaluation on real datasets of both AMP and a number of previous approaches for solving this problem. The Github repository contains Python implementations of AMP, noisy stochastic gradient descent, noisy Frank-Wolfe, objective perturbation, and two variants of output perturbation, as well as a number of benchmarks for generating experimental results.
Notes: The AMP algorithm and associated experimental results are described in an IEEE Symposium on Security and Privacy 2019 paper available here.
Affiliation/Organization(s) Contributing: Carnegie Mellon University; Boston University; University of California, Berkeley; University of California, Santa Cruz; Peking University
GitHub POC: @jnear
Keywords: Differential Privacy, K-Anonymity, Anonymization, Machine Learning
ARX is a comprehensive open source software for anonymizing sensitive personal data. It supports a wide variety of (1) privacy and risk models, (2) methods for transforming data and (3) methods for analyzing the usefulness of output data.
Affiliation/Organization(s) Contributing: TUM - Technical University of Munich
GitHub POC: @prasser
Keywords: Differential Privacy
Chorus is a tool for answering SQL queries with differential privacy. Chorus works with a standard SQL database, and scales to large datasets by offloading the heavy lifting of query answering to the database. To implement differential privacy mechanisms, Chorus uses a combination of query rewriting and post-processing.
Notes: Chorus is described in a EuroS&P paper available here.
Affiliation/Organization(s) Contributing: University of Vermont, University of California Berkeley
GitHub POC: @jnear
Chorus on GitHub Share Feedback
Keywords: Differential Privacy, Synthetic Data Generation
Participants in Match #3 of NIST's 2018 Public Safety Communications Research Differential Privacy Synthetic Data Challenge developed these open source algorithms as part of an effort to advance differential privacy. Participants were challenged to create new methods, or improve existing methods of data de-identification, while preserving the dataset’s utility for analysis. All solutions were required to satisfy the differential privacy guarantee, a provable guarantee of individual privacy protection. Participants used a data set of emergency response events occurring in San Francisco and a sub-sample of the IPUMS USA data for the 1940 U.S. Census. Contributions are listed in alphabetical order.
This repo contains an implementation for the award-winning solution to the 2018 Differential Privacy Synthetic Data Challenge by team UCLANESL. Our solution has been awarded the 5th place in Match #3 of the challenge and an earlier version has also won the 4th place in Match #1. The solution trains a wasserstein generative adversarial network (w-GAN) that is trained on the real private dataset. Differentially private training is applied by sanitizing (norm clipping and adding Gaussian noise) the gradients of the discriminator. Once the model is trained, it can be used to generate synthetic dataset by feeding random noise into the generator.
Team Members: Prof. Mani Srivastava (@msrivastava) - Team Captain (Match 1 and Match 3), Moustafa Alzantot (@malzantot) - (Match 1 and Match 3), Nat Snyder (@natsnyder1) - Match 1, Supriyo Charkaborty (@supriyogit) - Match 1
DP_WGAN-UCLANESL on GitHub More Information Share Feedback
This is the fourth place entry in the third round of the NIST Differential Privacy Synthetic Data Challenge. The goal of this challenge is to produce differentially private synthetic data while retaining as much useful information as possible about the original data set. Colorado census data from 1940 with 98 field columns were provided for algorithm development with census data from other states used for testing. This solution groups together fields which have been found to be highly correlated. For each of these groups, a histogram is created for the purpose of counting the number of occurrences of every possible combination of values of all fields in the group. For privatization, Laplacian noise is added to every bin with scale proportional to the number of groups / total epsilon. Synthetic data is generated by selecting a random bin for each group with probability weighted by these noisy bin counts. The field values corresponding to each group's selected bin are written out as a single row of synthetic data.
Team Member & Affiliation: John Gardner (no affiliation)
DPFieldGroups on GitHub Share Feedback
We present DPSyn, an algorithm for synthesizing microdata while satisfying differential privacy, and its instantiation to the dataset used in the competition, namely Public Use Microdata Sample (PUMS) of the 1940 USA Census Data.
Team Members & Affiliations: Ninghui Li (Purdue University), Zhikun Zhang (Zhejiang University), Tianhao Wang (Purdue University)
The first place entry in the third round of the NIST Differential Privacy Synthetic Data Challenge. The high-level idea is to (1) use the Gaussian mechanism to obtain noisy answers to a carefully selected set of counting queries (1, 2, and 3 way marginals) and (2) find a synthetic data set that approximates the true data with respect to those queries. The latter step is accomplished with [3], and the previous step uses ideas inspired by [1] and [2]. More specifically, this is done by calculating the mutual information (on the public dataset) for each pair of attributes and selecting the marginal queries that have high mutual information.
[1] Zhang, Jun, et al. "Privbayes: Private data release via bayesian networks." ACM Transactions on Database Systems (TODS) 42.4 (2017): 25.
[2] Chen, Rui, et al. "Differentially private high-dimensional data publication via sampling-based inference." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.
[3] McKenna, Ryan, Daniel Sheldon, and Gerome Miklau. "Graphical-model based estimation and inference for differential privacy." Proceddings of the 36th International Conference on Machine Learning. 2019.
Team Member & Affiliation: Ryan McKenna (UMass Amherst)
rmckenna Algorithm on GitHub Share Feedback
Keywords: Differential Privacy, Machine Learning
Train machine learning models with differential privacy by clipping and noising gradients during stochastic gradient descent.
Notes: Paper with full details: https://arxiv.org/abs/1607.00133
Affiliation/Organization(s) Contributing: Google
GitHub POC: @ilyamironov
DP-SGD on GitHub Share Feedback
Keywords: Differential Privacy, Machine Learning, Data Analytics
Diffprivlib is a general-purpose Python library for experimenting with, and, building tools for, differential privacy. Diffprivlib includes a number of algorithms for machine learning and data analytics with differential privacy off-the-shelf in the familiar Scikit-learn and Numpy syntax.
Notes: Introductory whitepaper
Affiliation/Organization(s) Contributing: IBM Research
GitHub POC and Email: @naoise-h | naoise [at] ibm.com (naoise[at]ibm[dot]com)
Diffprivlib on GitHub Share Feedback
Keywords: Differential Privacy, Verification of Algorithms, Machine Learning
Duet is a programming language which automatically derives (and checks) differential privacy bounds for programs written in the language. Duet is designed to support modern machine learning algorithms, and advanced variants of differential privacy in order to add minimal noise to algorithm results in order to ensure privacy.
Affiliation/Organization(s) Contributing: University of Vermont, University of California at Berkeley, University of Utah
GitHub User Serving as POC: @jnear
Keywords: Differential Privacy
Ektelo is a programming framework and system that aids programmers in developing differentially private programs with high utility. Ektelo can be used to author programs for a variety of statistical tasks that involve answering counting queries over a table of arbitrary dimension.
Notes: Ektelo is described in detail in a SIGMOD 2018 paper, titled "EKTELO: A Framework for Defining Differentially-Private Computations." https://dl.acm.org/citation.cfm?id=3196921
Affiliation/Organization(s) Contributing: UMass Amherst, Duke University, Colgate University
GitHub POC: @michaelghay
Keywords: Differential Privacy
Google's differential privacy library provides a set of building block components that allow developers to build differentially private applications in C++, Java, and Go. Furthermore, Google's DP library offers 'Privacy on Beam', an end-to-end implementation of differential privacy that helps developers perform operations in a differentially private manner. This codelab gives further insight.
Notes:
Affiliation/Organization(s) Contributing: Google
Email POC: dp-open-source [at] google.com (dp-open-source[at]google[dot]com)
Google Differential Privacy on GitHub Share Feedback
Keywords: Differential Privacy, Machine Learning, Database Queries
The tool provides differential privacy guarantees to statistical/machine learning algorithms by treating the underlying algorithm as a black-box, and only relying on input/output signatures. It implements a variant of the celebrated sample and aggregate framework by Nissim, Rashkhodnikova, and Smith, 2007. The empirical evaluation shows that the system scores well on various learning tasks (like clustering and regression).
Notes: GUPT is described in detail in a SIGMOD 2012 paper, titled "GUPT: Privacy Preserving Data Analysis Made Easy." A PDF is available here.
Affiliation/Organization(s) Contributing: University of California, Berkeley; University of California, Santa Cruz; Cornell University
GitHub POC: @prashmohan
Keywords: Differential Privacy, Verification of Algorithms, Machine Learning, Adversarial Examples
Adversarial examples that fool prediction models are a new class of attacks introduced by machine learning deployments. PixelDP is the first certified defense that both offers provable guarantees of robustness against these attacks and scales to large models and datasets, such as Google’s Inception on the ImageNet dataset. PixelDP's design relies on a novel use of differential privacy at prediction time.
Notes: This IEEE S&P 2019 research paper describes PixelDP.
Affiliation/Organization(s) Contributing: Columbia University
GitHub POC: @matlecu
PixelDP on GitHub Share Feedback
Keywords: K-Anonymity, Anonymization, Information Leakage, Algorithmic Fairness, Database Queries, Location Data
The Privacy Protection Application de-identifies databases that contain sequential geolocation data, sometimes called moving object databases. A record of a personally-owned vehicle’s route of travel is an example, but the tool can process other types of geolocation sequences. The application has a graphical user interface and operates on Linux, OS X, and Windows. Location suppression is the de-identification strategy used, and decisions about which locations to suppress are based on information theory. This strategy does not modify the precision of retained location information. One of the objectives is to produce data usable for vehicle safety analysis and transportation application development.
Notes: This tool treats static databases and has two versions. The main GUI versions uses a very efficient map matching strategy that may identify false roads for certain types of road structures. The tagged version (https://github.com/usdot-its-jpo-data-portal/privacy-protection-application/releases/tag/hmm-mm) uses a Hidden Markov Model map matching algorithm that is more accurate, but less efficient. This version is a command line tool that runs in Docker. Additionally, a streaming de-identification tool was developed for a USDOT Safety Pilot Study. This tool uses geofencing to identify locations that can be retained. It can also be found on GitHub: https://github.com/usdot-jpo-ode/jpo-cvdp
POC: carterjm [at] ornl.gov (carterjm[at]ornl[dot]gov)
Keywords: Differential Privacy, Machine Learning
The PATE framework achieves differentially private learning by carefully coordinating the activity of several different ML models.
Notes: Papers with full details: https://arxiv.org/abs/1802.08908
Affiliation/Organization(s) Contributing: Google
GitHub POC: @npapernot
PATE Framework on GitHub Share Feedback