New GIAB high-confidence calls vs. GRCh38 (and GRCh37)

598 views
Skip to first unread message

Justin Zook

unread,
Oct 17, 2016, 6:55:51 PM10/17/16
to Genome in a Bottle, GIAB Analysis Team, GA4GH DWG Benchmarking Task Team
Dear Colleagues,

We have a new version (v3.3.1) of high-confidence SNP, small indel, and homozygous reference calls for the Genome in a Bottle (GIAB) sample HG001 (aka NA12878) in comparison to both GRCh37 and GRCh38. These calls are available at ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/ NA12878_HG001/NISTv3.3.1/.  These are our first high-confidence calls for GRCh38, so we greatly value your feedback about the utility of these calls and how we might improve them.  In addition to making calls on GRCh38, we now have >40% more indel calls than v3.3 and >98% of calls are phased, and we have outlined some statistics and additional differences between these calls and previous versions below and in the README. The vcf and bed files are intended to be used in conjunction to benchmark accuracy of small variant calls.  We strongly recommend reading the information in the README prior to using these calls to understand how best to use them and their limitations.  As always, please also let us know if you find any potential issues with the calls. In the next 1-2 weeks, we will post GRCh37 v3.3.1 calls for HG002/HG003/HG004 (aka AJ son/father/mother), and HG005 (aka Chinese son) as well. If you find questionable or challenging sites, reporting them will help improve future callsets.  We encourage anyone to report information about particular sites at http://goo.gl/forms/OCUnvDXMEt1NEX8m2.   We also have a comparison of this callset and previous versions of our callsets to Platinum Genomes here - https://docs.google.com/spreadsheets/d/12EEwXH1iYFtZLQ3rqLMZYd_Vjnm-UOizQRwDYvjRVio/edit?usp=sharing

Changes in v3.3.1: 
1. Because freebayes sometimes misses calls in repetitive regions, now exclude tandem repeats of any size from freebayes callable regions. Also, exclude these regions from GATK-HC calls for Mate-Pair data since amplification causes a higher error rate. 
2. Change GATK-HC gvcf parsing to ignore reference bases with low GQ within 10bp of an indel, since these often caused us to exclude good indels.  This significantly increases the number of indels to 505169 in 3.3.1 vs. 358753 in v3.3 
3. Now make calls for GRCh38 in addition to GRCh37 (initially only for HG001). For illumina and 10X data, variant calls were made similarly to GRCh37 but from reads mapped to GRCh38 with decoy but no alts (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz).  For Complete Genomics, Ion exome, and SOLiD data, vcf and callable bed files were converted from GRCh37 to GRCh38 by Cory McLean from Verily using the tool GenomeWarp (https://github.com/verilylifesciences/genomewarp).  This tool converts vcf and callable bed files in a conservative and sophisticated manner, accounting for base changes that were made between the two references.  Modeled centromere and heterochromatin regions are explicitly excluded from the high-confidence bed. 
4. Add phasing information from Real Time Genomics and Illumina Platinum Genomes phased pedigree calls so that now 98-99% of calls are phased. Phasing process applied to v3.3.1 for HG001 by RTG (credits: Sean Irvine and Len Trigg): 
Using the fully-phased RTG SegregationPhasing v.37.3.3 and Illumina Platinum Genomes 2016-1.0, RTG used vcfeval to transfer phasing information to GIAB v3.3.1 for NA12878.  The resulting call sets for GRCh37 and GRCh38 are 98.6% and 98.0% phased, respectively.  The phase transfer is performed in such a way that the original calls and genotypes of GIAB v3.3.1 are not changed (other than phasing).  In both cases, the existing local phasing of GIAB v3.3.1 was dropped before doing the transfer.  
Details for fully-phased sets:  
RTG SegregationPhasing (SP): 100.0% (4222244/4224035) 
Cleaned RTG SegregationPhasing: 100.0% (4222244/4224024) 
Platinum Genomes (PG): 100.0% (4049512/4049512) 
Cleaned Platinum Genomes: 100.0% (4049512/4049512)  

Details for GRCh37:  
Original calls: 10.1% (388646/3843181) 
Original calls phasing removed: 0% (0/3843181) 
SP Transferred: 98.1% (3770070/3843181) 
SP and PG Transferred: 98.6% (3788674/3843181)  
All variants: 3843181 
SNPs: 3271601 
Insertions: 268387 
Deletions: 287319 
Indels: 15874 
Phased Genotypes: 98.6% (3788674/3843181) 
SNP Transitions/Transversions: 2.09 (3090872/1477889) 
Total Het/Hom ratio: 1.53 (2323803/1519378) 
SNP Het/Hom ratio: 1.52 (1975346/1296255) 
Insertion Het/Hom ratio: 1.40 (156335/112052) 
Deletion Het/Hom ratio: 1.59 (176522/110797) 
Indel Het/Hom ratio: 56.93 (15600/274) 
Insertion/Deletion ratio: 0.93 (268387/287319) 
Indel/SNP+MNP ratio: 0.17 (571580/3271601)   

Details for GRCh38:  
Original calls phased: 10.2% (377091/3709412) 
Original calls phasing removed: 0% (0/3709412) 
SP Transferred: 97.3% (3607917/3709412) 
SP and PG Transferred: 98.0% (3634961/3709412)  
All variants: 3709412 
SNPs: 3102724 
Insertions: 293598 
Deletions: 296423 
Indels: 16667 
Phased Genotypes: 98.0% (3634961/3709412) 
SNP Transitions/Transversions: 2.09 (2925118/1398168) 
Total Het/Hom ratio: 1.59 (2277006/1432406) 
SNP Het/Hom ratio: 1.54 (1883057/1219667) 
Insertion Het/Hom ratio: 1.74 (186337/107261) 
Deletion Het/Hom ratio: 1.82 (191242/105181) 
Indel Het/Hom ratio: 55.12 (16370/297) 
Insertion/Deletion ratio: 0.99 (293598/296423) 
Indel/SNP+MNP ratio: 0.20 (606688/3102724)

Cheers,
Justin
Reply all
Reply to author
Forward
0 new messages