The PETs Testbed provides the capability to investigate privacy-enhancing technologies (PETs) and their respective suitability for specific use cases.
The Testbed consists of example problems drawn from real-world use cases. Each problem is accompanied by a model solution that demonstrates a PETs implementation, real (or realistic data) benchmark data, and metrology to evaluate the success of the PETs deployment. We implement the PETs solution on NIST hardware, which is available to official collaborators. The Testbed also maintains a set of repositories that allow the public to reproduce solutions offline to ‘play at home.’
NIST plans to use these resources to support guidelines for PETs deployment and evaluation.
Offered through the National Cybersecurity Center of Excellence, NIST is developing a privacy-preserving federated learning environment as a model system to conduct cyber and privacy risk analysis.
Human genomic data is being collected and stored in data silos such as the Gene Expression Omnibus (GEO) and the NCI Genomic Data Commons (GDC). Analyzing genomic data across disparate silos can lead to improved research outcomes. For example, working across silos may be necessary to study rare cancers. Yet, aggregating information from multiple silos requires overcoming regulatory, data, security, and privacy issues. Privacy-preserving federated learning (PPFL) enables machine learning algorithms to train across silos while limiting personally identifiable information flows by adding private noise to the model weights (e.g., differential privacy) and encrypting computations (e.g., multi-party computation). PPFL is a promising architecture to balance machine learning data needs with privacy concerns. This project seeks to build PPFL model systems, frameworks, and assessment tools to help organizations evaluate the risks and benefits of PPFL systems.
Hosted by the National Cybersecurity Center of Excellence, this component investigates the trade offs of a distributed learning system deployed across silos of phenotype and genotype data.
The Testbed simulates several nodes (data silos) each of which contains phenotype and genotype data. Each node contains unique individuals. The task is to create a final output model that use single nucleotide polymorphism (SNP) data to predict physical traits. Initial data resources are from plant and other non-human genetic data to limit actual privacy risks to people. Privacy-preserving elements, such as differential privacy and multiparty computation, will be built into the distributed learning routine to limit information that can be used to conduct reconstruction, membership inference, and other privacy attacks.
PPFL output models will be evaluated using a set of fidelity and utility metrics. Privacy risks will be evaluated using a set of empirical privacy evaluation metrics and further tested through a series of red-teaming exercises.
The Testbed has developed a set of model PPFL architectures, model datasets, and evaluation methods. We’ve deployed some of these in our PPFL Red Teaming Exercise, which concluded in 2025.
We are currently working with official collaborators to test models, data, and evaluation methods. We anticipate publishing the codebase and datasets in July 2026. Prospective collaborators should email PETs [at] nist.gov (PETs[at]nist[dot]gov).
NIST is conducting a massive data de-identification benchmarking exercises, called the Collaborative Research Cycle (CRC). At its core, the CRC investigates the tradeoffs of competing de-identification and data synthesis techniques.
Tabular demographic data is ubiquitous and highly useful, yet the privacy risks of these data are extensive. Deidentification algorithms take records linked to individuals and attempt to produce data that does not identify individuals but remains useful for analysis. Effective deidentification can allow organizations to share useful information from potentially sensitive data. Such data can be used to train machine learning algorithms; expose fraud, waste, and abuse; improve health outcomes; and other uses. Synthetic data algorithms leverage generative models to reproduce sensitive data distributions using new, synthetic individuals.
While deidentification release mechanisms may, but not necessarily, improve privacy, these mechanisms can also distort data distributions by introducing artifacts and bias. Identifying and resolving these issues is important, but it is not trivial to do. The CRC provides target data, metrology, and a venue to compare the effects of varying deidentification and synthetic data methods.
The Collaborative Research Cycle (CRC) is a massive benchmarking exercise for deidentification and synthetic data algorithms. The CRC asks the community to deidentify target datasets drawn from the US Census Bureau. There are over 500 deidentified excerpts using techniques including redaction, k-anonymity, differentially private synthetic data, and many other algorithms. Each excerpt is evaluated with SDNist, a python library for examining deidentied data fidelity, utility, and privacy.
The CRC target data, the database of deidentified excerpts, the SDNist evaluation software, and quick-start notebooks are all in the public domain. We invite the public to use these tools to improve deidentification technologies.
We continue to invite the community to contribute deidentified data excerpts. We welcome collaborations on analysis of our growing database of techniques, on new fidelity, utility, and privacy metrics, and related research goals. Please reach out to PETs [at] nist.gov (PETs[at]nist[dot]gov) for more information.
Please reach out to PETs [at] nist.gov (PETs[at]nist[dot]gov) with questions or for collaborations.