New GIAB high-confidence calls for HG001/NA12878 and HG002/AJ Son

1,552 views
Skip to first unread message

Justin Zook

unread,
May 27, 2016, 12:32:45 PM5/27/16
to GIAB Analysis Team, Genome in a Bottle

Dear GIAB Participants, 

We have been developing a new, reproducible integration process to create high-confidence SNP, small indel, and homozygous reference calls for benchmarking small variant calls.  We have used this new process to generate v3.2 of our high-confidence calls for HG001 (aka NA12878), as well as the first high-confidence callset for HG002 (aka Ashkenazim Jewish Son), which are under ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/.  These calls were generated only from data generated from our NIST Reference Material batch of cells.  Planned future work will include (1) generating similar callsets on the GIAB/PGP AJ parents and Chinese trio, (2) incorporating additional datasets from these genomes to characterize more difficult regions, (3) incorporating pedigree information for phasing and to improve the calls, (4) developing better methods for homopolymer characterization, (5) making calls on X and Y for males, and (6) characterizing larger indels and structural variation.  We greatly value the help many of you have given already, and we very much welcome additional feedback to help us improve future versions of these calls at http://goo.gl/forms/OCUnvDXMEt1NEX8m2.

We highly recommend reading the information below prior to using these calls to understand how best to use them and their limitations.  There is also a draft google doc that describes the methods used to form these calls and some preliminary comparisons to other callsets at:  https://docs.google.com/document/d/1NE-NGSMslFFndQpbqnvu4wfnwgmgOcQu6KIjSbCauYg/edit?usp=sharing  


Best Practices for Using High-confidence Calls:

Benchmarking variant calls is a complex process, and best practices are still being developed by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team (https://github.com/ga4gh/benchmarking-tools/).   Several things are important to consider when benchmarking variant call accuracy:

1. Complex variants (e.g., nearby SNPs and indels or block substitutions) can often be represented correctly in multiple ways in the vcf format.  Therefore, we recommend using sophisticated benchmarking tools like those developed by members of the GA4GH benchmarking team.  Currently, the most developed tools are rtgtools vcfeval (http://realtimegenomics.com/products/rtg-tools/) or hap.py (https://github.com/Illumina/hap.py).  See supplementary information in https://docs.google.com/document/d/1NE-NGSMslFFndQpbqnvu4wfnwgmgOcQu6KIjSbCauYg/edit?usp=sharing for a suggested process for doing comparisons.

2. By their nature, high-confidence variant calls and regions tend to include a subset of variants and regions that are easier to characterize.  Accuracy of variant calls outside the high-confidence regions is generally likely to be lower than inside the high-confidence regions, so benchmarking against high-confidence calls will usually overestimate accuracy for all variants in the genome.  Similarly, it is possible that a variant calling method could have higher accuracy statistics compared to other methods when compared to the high-confidence variant calls but have lower accuracy compared to other methods for all variants in the genome.

3. Stratification of performance for different variant types and different genome contexts can be very useful when assessing performance, because performance often differs between variant types and genome contexts.  In addition, stratification can elucidate variant types and genome contexts that fall primarily outside high-confidence regions.  Standardized bed files for stratifying by genome context are available from GA4GH at https://github.com/ga4gh/benchmarking-tools/tree/master/resources/stratification-bed-files.  

4. Particularly for targeted sequencing, it is critical to calculate confidence intervals around statistics like sensitivity because there may be very few examples of variants of some types in the benchmark calls in the targeted regions.

Manual curation of sequence data in a genome browser for a subset of false positives and false negatives is essential for an accurate understanding of statistics like sensitivity and precision.  Curation can often help elucidate whether the benchmark callset is wrong, the test callset is wrong, both callsets are wrong, or the true answer is unclear from current technologies.  If you find questionable or challenging sites, reporting them will help improve future callsets.  You can report information about particular sites at http://goo.gl/forms/OCUnvDXMEt1NEX8m2.


Known issues with high-confidence calls:

There are a number of known issues in the current high-confidence callset, which we plan to address in future releases:

1. Compound heterozygous sites are sometimes represented as a heterozygous deletion that overlaps with a heterozygous SNP.  At these sites, the genotype “0/1” or “0|1” should not be interpreted as meaning that the “0” haplotype is reference.  Hap.py and vcfeval (with the --ref-overlap flag) interpret these sites as intended.

2. The callset currently only includes local phasing information from freebayes, which does not use the PS field, so that locally phased variants should not be assumed to be phased with variants that are phased at a distant location.

3. The high-confidence calls have an increased FP, FN, and genotyping error rate for single base indels in homopolymers.  Some of these sites have a low allele fraction variant in multiple technologies so that it is unclear whether the site is a systematic sequencing error or a true variant in a fraction of the cells.  Some of these sites are listed in https://docs.google.com/spreadsheets/d/1kHgRLinYcnxX3-ulvijf2HrIdrQWz5R5PtxZS-_s6ZM/edit?usp=sharing  


Differences between v2.19 and v3.2 integration methods:

The new v3.2 integration methods differ from the previous GIAB calls (v2.18 and v2.19) in several ways, both in the data used and the integration process and heuristics:

1. Only 4 datasets were used, selected because they represent 4 sequencing technologies that were generated from the NIST RM 8398 batch of DNA.  They are: 300x Illumina paired end WGS, 100x Complete Genomics WGS, 1000x Ion exome sequencing, and SOLID WGS.

2. Mapping and variant calling algorithms designed specifically for each technology were used to generate sensitive variant callsets where possible: novoalign + GATK-haplotypecaller and freebayes for Illumina, the vcfBeta file from the standard Complete Genomics pipeline, tmap+TVC for Ion exome, and Lifescope+GATK-HC for SOLID.  This is intended to minimize bias towards any particular bioinformatics toolchain.

3. Instead of forcing GATK to call genotypes at candidate variants in the bam files from each technology, we generate sensitive variant call sets and a bed file that describes the regions that were callable from each dataset.  For Illumina, we used GATK callable loci to find regions with at least 20 reads with MQ >= 20 and with coverage less than 2x the median.  For Complete Genomics, we used the callable regions defined by the vcfBeta file and excluded +-50bp around any no-called or half-called variant.  For Ion, we intersected the exome targeted regions with callableloci (requiring at least 20 reads with MQ >= 20).  Due to the shorter reads and low coverage for SOLID, it was only used to confirm variants, so no regions were considered callable.

4. A new file with putative structural variants was used to exclude potential errors around SVs.  These were SVs derived from multiple PacBio callers and multiple integrated illumina callers using MetaSV.  These make up a significantly smaller fraction of the calls and genome (~4.5%) than the previous bed, which was a union of all dbVar calls for NA12878 (~10%).

5. Homopolymers >10bp in length, including those interrupted by one nucleotide different from the homopolymer, are now excluded from the high-confidence regions, because these were seen to have a high error rate for all technologies.  This constitutes a high fraction of the putative indel calls, so more work is needed to resolve these.

6. A bug that caused nearby variants to be missed in v2.19 is fixed in the new calls.

7. The new vcf contains variants outside the high-confidence bed file.  This enables more robust comparison of complex variants or nearby variants that are near the boundary of the bed file.  It also allows the user to evaluate concordance outside the high-confidence regions, but these concordance metrics should be interpreted with great care.

8. We now output local phasing information when it is provided in the Illumina freebayes calls.


Cheers,

Justin

Justin Zook

unread,
Jun 10, 2016, 11:20:18 AM6/10/16
to Genome in a Bottle, GIAB Analysis Team, ga4gh-dwg-b...@genomicsandhealth.org
Dear all,

We wanted to update you that we have released a new version (v3.2.2) of calls for HG001 and HG002 based on feedback received about v3.2: 

Changes in v3.2.2:
We found a bug in one of the input call sets that was causing many of the variants on chr7 in HG001 and HG002 and on chr9 on HG002 to be reported as not high confidence.  We have corrected this issue, and all other chromosomes are the same as in v3.2.1.

Changes in v3.2.1:
The only change in v3.2.1 is to exclude regions from our high confidence bed file if they have homology to the decoy sequence hs37d5.  These decoy-related regions were generated by Heng Li and are available as mm-2-merged.bed at https://github.com/ga4gh/benchmarking-tools/tree/master/resources/stratification-bed-files/SegmentalDuplications. Our high-confidence vcf generally does not include variants if they are homologous to decoy sequence. Although most of these likely should be excluded, some real variants may be missed, so we are excluding these regions from our high-confidence regions until they are better characterized. However, it is important to note that these regions may be enriched for false positives if the decoy is not used in mapping.

Additional information about v3.2 is in the previous email below.

Cheers,
Justin

> --
> You received this message because you are subscribed to the Google Groups "GA4GH DWG Benchmarking Task Team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> ga4gh-dwg-benchma...@genomicsandhealth.org.
>
>
Reply all
Reply to author
Forward
0 new messages