Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Summary

Frequently Asked Questions about the Genome in a Bottle Consortium, NIST's human genome reference materials, and data resources

Description

NIST worked with Coriell to grow large batches of cells, extract DNA, mix the DNA well, and aliquot into 1000’s of vials that are the NIST Reference Materials for HG001-HG005.  These were characterized under the NIST quality system, and may differ in small ways from the DNA at Coriell, which is from different batches of cells, though in general these differences are expected to be small.  The NIST price is higher because it incorporates some of the costs of the NIST quality system and the extensive NIST/GIAB characterization of these samples.

GIAB currently develops new benchmarks first on the PGP Ashkenazi Jewish son HG002 (NIST RM 8391), since it has the most extensive trio data and is part of the broad consent of the PGP.  This currently includes benchmarks extending our small variant and structural variant calls. Over 50 commercial products based on this cell line are also available.  Therefore, we recommend that you start with HG002/RM8391, though it is often helpful to use all seven of the GIAB genomes.

Yes, our benchmarks can be used to assess targeted exome and gene panel sequencing.  You generally will want to subset to your regions of interest, e.g., using the --target-regions option in hap.py.  One important limitation is that our benchmarks contain limited numbers of difficult small variants and structural variants in exons, particularly for targeted panels, so it is particularly important to calculate confidence intervals for performance metrics like precision and recall.  One resource for more challenging variants in clinically important regions spiked into GIAB samples is described in https://doi.org/10.1038/s41436-021-01187-w.

Yes, we encourage all to use these data. All GIAB data is made available with no embargo on publications using the data.  We ask that you cite the appropriate reference in the README for each dataset and/or our data publication (https://www.nature.com/articles/sdata201625).  When using our benchmarks, please cite our best practices for benchmarking paper (https://www.nature.com/articles/s41587-019-0054-x), and the paper describing the benchmark used (listed on the main GIAB website).

GIAB describes its references in a README under https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/, including new GRCh38 references masking false duplications discovered by GIAB and T2T.

GIAB is making public and analyzing extensive data for a new pancreatic cancer cell line (HG008-T) and matched normal tissues (HG008-N) that are broadly consented for public genomic data sharing. NIST is exploring additional possibilities for developing appropriately consented tumor/normal cell lines for reference material development.  NIST has also characterized CNVs in EGFR, HER2, and MET in several tumor cell lines in SRM 2373 and RM 8366.  The Medical Device Innovation Consortium has put together a Somatic Reference Sample Landscape Report that describes many of the available somatic reference samples available as of early 2019 (https://mdic.org/wp-content/uploads/2019/03/MDIC-SRS-Landscape-Analysis-Report.pdf).

A good first step to learn about active work is to read recent emails in and sign up for the general GIAB and analysis team google groups:
Analysis Team: https://groups.google.com/forum/#!forum/giab-analysis-team
General GIAB: https://groups.google.com/forum/#!forum/genome-in-a-bottle

We include some variants outside the benchmark bed file because they reduce the risk of our benchmark including only part of a complex variant (e.g., when one indel is just inside the bed and one is just outside).  These complex variants can often be represented in multiple ways in the vcf file, and it is important that the benchmark vcf include all parts of a complex variant, even if part falls outside the bed, in order to ensure that benchmarking tools will not erroneously count different, but correct, representations of the complex variant as incorrect.

In 2018, we decided to change the terminology for our vcf and bed files from "high-confidence" to "benchmark" in order to more clearly convey their intended use for benchmarking performance.  Although we do still have high confidence that the variants are largely true, sometime "high-confidence regions" were interpreted as meaning that everyone should have confidence in their variants in these regions.  Especially as we expand to more difficult regions, our benchmark regions will contain variants and regions that are difficult to characterize for some methods.  In fact, our benchmark variants and regions are intended to enable anyone to determine how well any method performs for different types of variants and genome contexts within our benchmark regions. 

We have a google form to report small variant errors at https://forms.gle/JcYmJSMTdRfXMvcUA and structural variant errors at https://forms.gle/hmTHtgyRzHozwT4C6, and you can also email Justin Zook at NIST.

No, these are in fact different trios of Chinese ancestry.  The GIAB Ashkenazi and Chinese trios are from the Personal Genome Project, since they are more broadly consented, including for commercial redistribution, development of iPSCs, etc.  For SVs, we developed the first benchmark that enables both sensitivity and specificity assessment for the son in the Ashkenazi trio (HG002) – see https://doi.org/10.1101/664623.

​​​​​​​Currently, we only have the v0.6 SV benchmark on GRCh37, but we do have an SV benchmark for 273 challenging medically relevant genes on both GRCh37 and GRCh38. We are now working on assembly-based SV benchmarks for the whole genome, and have made available draft HG002 benchmarks on GRCh37, GRCh38, and T2T-CHM13v2.0.

Created April 4, 2019, Updated December 17, 2024