Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PETs Testbed

The PETs Testbed provides the capability to investigate privacy-enhancing technologies (PETs) and their respective suitability for specific use cases. 

The testbed consists of a set of example problems drawn from real-world use-case. Each problem is accompanied by a model solution that demonstrates a PETs implementation, documents the privacy risks of the implementation, and provides fidelity, utility, and privacy metrology to integrate the implementation. The testbed provides fully implemented PETs architectures connected to data and hardware resources for official collaborators. The Testbed also maintains a set of repositories that allow the public to reproduce solutions offline to ‘play at home.’ 

Model Problem 1: Prediction of phenotype by genotype across data silos

In collaboration with the U.S. Census Bureau XD Team and offered through the National Cybersecurity Center of Excellence, NIST plans to develop a privacy-preserving federated learning environment as a model system to conduct cyber and privacy risk analysis.

Motivation:

Human genomic data is being collected and stored in data silos such as the Gene Expression Omnibus (GEO) and the NCI Genomic Data Commons (GDC). Analyzing genomic data across disparate silos can lead to improved research outcomes. For example, working across silos may be necessary to study rare cancers. Yet, aggregating information from multiple silos requires overcoming regulatory, data, security, and privacy issues. Privacy-preserving federated learning (PPFL) enables machine learning algorithms to train across silos while limiting personally identifiable information flows by adding private noise to the model weights (e.g., differential privacy) and encrypting computations (e.g., multi-party computation). PPFL is a promising architecture to balance machine learning data needs with privacy concerns. This project seeks to build PPFL model systems, frameworks, and assessment tools to help organizations evaluate the risks and benefits of PPFL systems. 

Testbed component: Privacy-Preserving Federated Learning (PFFL) Model Environment for Genomics Data

Hosted by the National Cybersecurity Center of Excellence, this component investigates the trade offs of a distributed learning system deployed across silos of phenotype and genotype data.

The testbed simulates several nodes (data silos) each of which contains phenotype and genotype data. Each node contains unique individuals. The task is to create a final output model that use single nucleotide polymorphism (SNP) data to predict physical traits. Initial data resources are from plant and other non-human genetic data to limit actual privacy risks to people. Privacy-preserving elements, such as differential privacy and multiparty computation, will be built into the distributed learning routine to limit information that can be used to conduct reconstruction, membership inference, and other privacy attacks. 

PPFL output models will be evaluated using a set of fidelity and utility metrics. Privacy risks will be evaluated using a set of empirical privacy evaluation metrics and further tested through a series of red-teaming exercises.

Testbed outputs:

The testbed will develop a set of model PPFL architectures. These model architectures will be fully implemented in a cloud environment. Each architecture will be subjected to a privacy threat evaluation that will involve the use of tools such as fidelity and utility metrics, the NIST Privacy Framework, a suite of empirical privacy metrics, and the outcomes of a privacy red-teaming exercise. These tools will serve as a framework for helping organizations navigate the tradeoffs in a PPFL system. 

Collaborations:

By the end of 2024, NIST expects to have a fully functional demonstration PPFL architecture and a suite of tools to evaluate model utility and privacy risks. We will be inviting collaborators to develop PPFL algorithms and architectures, contribute to our evaluation metrics, and participate in our red-teaming exercises. 

Send an email to ppfl [at] list.nist.gov (ppfl[at]list[dot]nist[dot]gov) to receive project updates. 

Model Problem 2: Tabular data deidentification of demographic records

Motivation:

Tabular demographic data (e.g., surveys) are essential to administering government programs (e.g., representation apportionment, welfare spending, infrastructure planning, etc.). These data are invaluable for policy making and research purposes. Government agencies with mandates to collect this type of data are typically required to both publish the data and also maintain the confidentiality of the records. Deidentification techniques such as those using synthetic data and differential privacy, can help organizations navigate the tension between publishing data and protecting the privacy of individuals. 

Testbed component: The Collaborative Research Cycle 

The Collaborative Research Cycle investigates the effect of deidentification algorithms on the fidelity, utility, and privacy of privatized data. In this problem, participants use any approach to reduce privacy risks to the NIST Diverse Community Data Excerpts, a subset of the American Community Survey. Participants submit their privatized data and an abstract on how the data were generated. We use the SDNist software package to evaluate the data. The private data, the abstract, and their evaluation results are then archived in human- and machine-readable repository built for metaanalysis of the deidentification techniques. We are currently working on incorporating a much more expensive set of empirical privacy evaluation tools into SDNist to better evaluate privacy risks of the private data. 

Testbed outputs:

The CRC has an archive of nearly 500 deidentified data instances (and we’re still accepting more!), each accompanied by a detailed abstract on the generation methods, and detailed evaluation data. The program also has tools to make it easy to parse and navigate the data. The CRC has hosted a series of workshops to gather the community to investigate deidentification. 

Send an email to CRC+subscribe [at] list.NIST.gov (CRC+subscribe[at]list[dot]NIST[dot]gov) to join the CRC listserv for future updates.   

Created August 26, 2024, Updated September 3, 2024