New GIAB High-confidence calls and draft manuscript

472 views
Skip to first unread message

Justin Zook

unread,
Mar 3, 2017, 4:16:48 PM3/3/17
to Genome in a Bottle, GIAB Analysis Team
Dear GIAB Participants,

We are pleased to announce a new version of high-confidence small variant and reference calls, v3.3.2, for all 5 GIAB genomes that are NIST Reference Materials on both GRCh37 and GRCh38 (available under ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/).  These calls include a variety of improvements, listed in detail below, including improved accuracy and family-based paternal | maternal phasing for HG001 and HG002.  We also have made available a draft manuscript describing the methods used to form these calls, assessments of their quality, and example comparisons: https://docs.google.com/document/d/13XTVwrlAgugYEYtL3W64KX2hy-7xzWNTj8qPN8rULVU/edit?usp=sharing
Suggestions for improving this draft manuscript are very welcome.  We plan to submit a preprint of this manuscript soon.  Until it is available, please cite http://www.nature.com/nbt/journal/v32/n3/full/nbt.2835.html (doi:10.1038/nbt.2835) and http://www.nature.com/articles/sdata201625 (doi:10.1038/sdata.2016.25) when using these calls.

We strongly recommend reading the README accompanying the high-confidence calls, or the information below, prior to using these calls.  We encourage everyone to report information about particular questionable sites at http://goo.gl/forms/OCUnvDXMEt1NEX8m2.

Cheers,
Justin

Important usage notes:
The vcf and bed files are intended to be used in conjunction to benchmark accuracy of small variant calls.  We strongly recommend reading the information below prior to using these calls to understand how best to use them and their limitations.  

Best Practices for Using High-confidence Calls:
Benchmarking variant calls is a complex process, and best practices are still being developed by the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team (https://github.com/ga4gh/benchmarking-tools/).  Several things are important to consider when benchmarking variant call accuracy:
1. Complex variants (e.g., nearby SNPs and indels or block substitutions) can often be represented correctly in multiple ways in the vcf format.  Therefore, we recommend using sophisticated benchmarking tools like those developed by members of the GA4GH benchmarking team.  The latest version of hap.py (https://github.com/Illumina/hap.py) now allows the user to choose between hap.py’s xcmp and RTG’s vcfeval comparison tools, both of which perform sophisticated variant comparisons.  Preliminary tests indicate they perform very similarly, but vcfeval matches some additional variants where only part of a complex variant is called.
2. By their nature, high-confidence variant calls and regions tend to include a subset of variants and regions that are easier to detect.  Accuracy of variant calls outside the high-confidence regions is generally likely to be lower than inside the high-confidence regions, so benchmarking against high-confidence calls will usually overestimate accuracy for all variants in the genome.  Similarly, it is possible that a variant calling method has higher accuracy statistics compared to other methods when compared to the high-confidence variant calls but has lower accuracy compared to other methods for all variants in the genome.
3. Stratification of performance for different variant types and different genome contexts can be very useful when assessing performance, because performance often differs between variant types and genome contexts.  In addition, stratification can elucidate variant types and genome contexts that fall primarily outside high-confidence regions.  Standardized bed files for stratifying by genome context are available from GA4GH at https://github.com/ga4gh/benchmarking-tools/tree/master/resources/stratification-bed-files, and these can be added directly into the hap.py comparison framework. 
4. Particularly for targeted sequencing, it is critical to calculate confidence intervals around statistics like sensitivity because there may be very few examples of variants of some types in the benchmark calls in the targeted regions. 
5. Manual curation of sequence data in a genome browser for a subset of false positives and false negatives is essential for an accurate understanding of statistics like sensitivity and precision.  Curation can often help elucidate whether the benchmark callset is wrong, the test callset is wrong, both callsets are wrong, or the true answer is unclear from current technologies.  If you find questionable or challenging sites, reporting them will help improve future callsets.  We encourage anyone to report information about particular sites at http://goo.gl/forms/OCUnvDXMEt1NEX8m2
6. There is a new app on precisionFDA (Vcfeval + Hap.py Comparison Custom Stratification) that performs comparisons in an online interface, and by default it uses a large set of stratification bed files to assess performance in different types of difficult regions as well as different types of complex variants in each genome.


Changes in v3.3.2:
1. Fix bug in callable regions script for GATKHC gvcf, which was erroneously determining too many regions to be callable; add "difficult region" bed file annotation to high confidence vcf
2. Filter sites that are within 50bp of another passing call but none of the callsets that support the 2 calls match, because some nearby conflicting calls from different callers were both considered high confidence if another callset from the same dataset was filtered.  This eliminates many problematic cases, but a small number of conflicting calls remain.
3. Subtract SV regions from HG005 bed when called by MetaSV in any member of the Chinese trio (Thanks to Roche/Bina for running MetaSV)
4. We have GRCh38 calls for all individuals.
5. For new GRCh38 analyses of Illumina and 10X data in AJ trio and Chinese son, use sentieon haplotyper in place of GATK-HC, since it gives essentially identical results and runs faster. (Thanks to Rafael Saldana at Sentieon for help with this)
6. Use RTG vcfeval tools to harmonize representation of complex variants in AJ trio prior to performing Mendelian inheritance analysis and phasing. Apply trio phasing to HG002 high confidence vcf, and exclude 50bp from high-confidence beds for all AJ trio individuals on either side of Mendelian inheritance errors that are not de novo mutations.
7. For the phased vcfs for HG001 and HG002, include phasing information both from family-based phasing and from local read-based phasing, prioritizing family-based phasing.  For family-based phased calls, the PS field contains PATMAT.  For local read-based phased calls, the PS field contains the PID from GATKHC.  For homozygous calls that were not otherwise phased, we changed their status to phased and put HOMVAR in PS.  Pedigree and trio phased calls have alleles in the order paternal|maternal. For phasing comparisons that require paternal|maternal phasing, only calls with PATMAT in PS should be included.  For HG001, 99.0% of high-confidence calls are phased by the Platinum Genome or RTG pedigree analyses, and 99.5% are phased by the pedigree, GATKHC, or are homozygous variant.  For HG002, 87.0% of high-confidence calls are phased by the trio, and 89.5% of calls are phased by the trio, GATKHC, or are homozygous variant. (Thanks to Sean Irvine and Len Trigg at RTG for help with these analyses)
Reply all
Reply to author
Forward
0 new messages