Dear Colleagues,
We have a new version (v3.3.1) of high-confidence SNP, small indel, and homozygous reference calls for the Genome in a Bottle (GIAB) sample HG001 (aka NA12878) in comparison to both GRCh37 and GRCh38. These calls are available at
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/ NA12878_HG001/NISTv3.3.1/. These are our first high-confidence calls for GRCh38, so we greatly value your feedback about the utility of these calls and how we might improve them. In addition to making calls on GRCh38, we now have >40% more indel calls than v3.3 and >98% of calls are phased, and we have outlined some statistics and additional differences between these calls and previous versions below and in the README. The vcf and bed files are intended to be used in conjunction to benchmark accuracy of small variant calls. We strongly recommend reading the information in the README prior to using these calls to understand how best to use them and their limitations. As always, please also let us know if you find any potential issues with the calls. In the next 1-2 weeks, we will post GRCh37 v3.3.1 calls for HG002/HG003/HG004 (aka AJ son/father/mother), and HG005 (aka Chinese son) as well. If you find questionable or challenging sites, reporting them will help improve future callsets. We encourage anyone to report information about particular sites at
http://goo.gl/forms/OCUnvDXMEt1NEX8m2. We also have a comparison of this callset and previous versions of our callsets to Platinum Genomes here -
https://docs.google.com/spreadsheets/d/12EEwXH1iYFtZLQ3rqLMZYd_Vjnm-UOizQRwDYvjRVio/edit?usp=sharing.
Changes in v3.3.1:
1. Because freebayes sometimes misses calls in repetitive regions, now exclude tandem repeats of any size from freebayes callable regions. Also, exclude these regions from GATK-HC calls for Mate-Pair data since amplification causes a higher error rate.
2. Change GATK-HC gvcf parsing to ignore reference bases with low GQ within 10bp of an indel, since these often caused us to exclude good indels. This significantly increases the number of indels to 505169 in 3.3.1 vs. 358753 in v3.3
4. Add phasing information from Real Time Genomics and Illumina Platinum Genomes phased pedigree calls so that now 98-99% of calls are phased. Phasing process applied to v3.3.1 for HG001 by RTG (credits: Sean Irvine and Len Trigg):
Using the fully-phased RTG SegregationPhasing v.37.3.3 and Illumina Platinum Genomes 2016-1.0, RTG used vcfeval to transfer phasing information to GIAB v3.3.1 for NA12878. The resulting call sets for GRCh37 and GRCh38 are 98.6% and 98.0% phased, respectively. The phase transfer is performed in such a way that the original calls and genotypes of GIAB v3.3.1 are not changed (other than phasing). In both cases, the existing local phasing of GIAB v3.3.1 was dropped before doing the transfer.
Details for fully-phased sets:
RTG SegregationPhasing (SP): 100.0% (4222244/4224035)
Cleaned RTG SegregationPhasing: 100.0% (4222244/4224024)
Platinum Genomes (PG): 100.0% (4049512/4049512)
Cleaned Platinum Genomes: 100.0% (4049512/4049512)
Details for GRCh37:
Original calls: 10.1% (388646/3843181)
Original calls phasing removed: 0% (0/3843181)
SP Transferred: 98.1% (3770070/3843181)
SP and PG Transferred: 98.6% (3788674/3843181)
All variants: 3843181
SNPs: 3271601
Insertions: 268387
Deletions: 287319
Indels: 15874
Phased Genotypes: 98.6% (3788674/3843181)
SNP Transitions/Transversions: 2.09 (3090872/1477889)
Total Het/Hom ratio: 1.53 (2323803/1519378)
SNP Het/Hom ratio: 1.52 (1975346/1296255)
Insertion Het/Hom ratio: 1.40 (156335/112052)
Deletion Het/Hom ratio: 1.59 (176522/110797)
Indel Het/Hom ratio: 56.93 (15600/274)
Insertion/Deletion ratio: 0.93 (268387/287319)
Indel/SNP+MNP ratio: 0.17 (571580/3271601)
Details for GRCh38:
Original calls phased: 10.2% (377091/3709412)
Original calls phasing removed: 0% (0/3709412)
SP Transferred: 97.3% (3607917/3709412)
SP and PG Transferred: 98.0% (3634961/3709412)
All variants: 3709412
Insertions: 293598
Deletions: 296423
Indels: 16667
Phased Genotypes: 98.0% (3634961/3709412)
SNP Transitions/Transversions: 2.09 (2925118/1398168)
Total Het/Hom ratio: 1.59 (2277006/1432406)
SNP Het/Hom ratio: 1.54 (1883057/1219667)
Insertion Het/Hom ratio: 1.74 (186337/107261)
Deletion Het/Hom ratio: 1.82 (191242/105181)
Indel Het/Hom ratio: 55.12 (16370/297)
Insertion/Deletion ratio: 0.99 (293598/296423)
Indel/SNP+MNP ratio: 0.20 (606688/
3102724)
Cheers,
Justin