Dear GIAB Analysis Team,
As many of you know, we have been working on improving our January v0.5.0 draft benchmark large indel and SV callset. We now have a new v0.6 draft benchmark SV callset, which has 2 tiers of calls >=50bp for HG002 and is accompanied by high-confidence regions to enable both FP and FN assessment. The Tier 1 calls and regions are most straightforward to use for benchmarking, since they are isolated, sequence-resolved calls. The Tier 2 regions could help enable performance assessment for more challenging variants, since we have good evidence for at least one SV in these regions, but the region is complex or not confidently sequence-resolved.
Before our next GIAB Analysis Team call on June 25, we are asking for some of you to do some initial evaluations of these calls and regions. After receiving this initial feedback to ensure there aren’t any problems to fix or easy ways to improve the calls, we will ask for volunteers to perform more in depth evaluations that would be used to write a manuscript about our benchmark set. The README pasted below contains additional information about the process used to form these calls and how to use the benchmarking tool truvari to compare to the Tier 1 calls and regions. Please email me or the team if you have any feedback or questions before the call on June 25.
The v0.6 calls and regions for HG002 are at:
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/
Cheers,
Justin
README:
This folder contains v0.6 "straw man" of sequence-resolved structural variant calls >=50bp for the GIAB Ashkenazi son. We now separate the calls into 2 tiers: (1) isolated, sequence-resolved SVs and (2) regions with at least one likely SV but it is complex or we were unable to determine a consensus sequence change. We also are disseminating our first high-confidence SV bed file, defining regions in which we believe our Tier 1 callset should be comprehensive, so that any extra variants detected by a method should be false positives. We recommend using these calls critically, since they may contain false positives, false negatives, and inaccurate breakpoints and sequence predictions. We are requesting feedback about the utility, accuracy, and comprehensiveness of these calls in order to work towards increasingly reliable benchmark callsets.
A new SV benchmarking tool truvari can be used to compare a vcf to our Tier 1 vcf and bed files at specified matching stringency. An example command is:
truvari.py -b HG002_SVs_Tier1_v0.6.vcf.gz -c your.vcf.gz -o yourvcfvsGIABv0.6 --passonly --includebed HG002_SVs_Tier1_v0.6.bed -r 2000 --giabreport
As described in the truvari documentation, -p 0 can be used with truvari to match only size and type and ignore sequence differences, which can particularly be useful for callsets that are not sequence-resolved.
These calls were created using the following process, which is identical to v0.5.0 in steps 1-5:
1. A union set of >1 million sequence-resolved calls >=20bp using 30+ callers from 5 short, long, and linked read datasets was collected from the GIAB community (union_171212_refalt.sort.vcf.gz).
2. These calls were compared to calls within 100kbp using svanalyzer SVcomp (https://github.com/nhansen/SVanalyzer/blob/master/scripts/), which determines how similar the sequence changes predicted by 2 calls are. Calls with all distance metrics <20% of the size were clustered, as described here (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_UnionSVs_12122017/SVmerge121217/) to produce union_171212_refalt.2.2.2.clustered.vcf.gz
3. union_171212_refalt.2.2.2.clustered.vcf.gz was subset to 74012 calls discovered by 2+ technologies or 4+ callsets across the trio, or calls with a size <500bp different from a BioNano insertion or deletion call in the same region, or calls with a Nabsys svm_score > 0.9. (Note that v0.4.0 was similar except it excluded calls with support from only one technology and <5 callsets in all individuals).
4. svviz2 (genosv) was used to find how many reads supported the reference and alternate alleles for each of the 74012 calls from #3, using 4 datasets from HG002, HG003, and HG004: 250bp Illumina, haplotype-separated 10X Genomics, Illumina 6kb Mate-pair, PacBio, and PacBio haplotype-partitioned by 10x phasing using whatshap.
5. Genotypes for HG002, HG003, and HG004 were determine for each dataset based on cut-offs for weighted ALT/(REF+ALT) counts determined manually from looking at distributions for different size ranges. In particular, for PacBio, we required ALT+REF>=8 and for GT=0/0 ALT/(REF+ALT)<0.1, for GT=0/1 0.25<ALT/(REF+ALT)<0.75, and for GT=1/1 ALT/(REF+ALT)>0.9. For 250bp and MP Illumina, we required ALT+REF>=8 and for GT=0/0 ALT/(REF+ALT)<0.05, for GT=0/1 0.1<ALT/(REF+ALT)<0.9, and for GT=1/1 ALT/(REF+ALT)>0.95. For 10x Genomics, we required ALT+REF>=5 in each haplotype and ALT/(REF+ALT)<0.05 or ALT/(REF+ALT)>0.95 in each haplotype, and the GT was determine from the combination of the 2 haplotypes. Variants not meeting these criteria were assigned a GT of "./." for that dataset. In addition, if no callset from an individual in the trio was in a cluster (e.g., if HG2count=0 for HG002), that individual was not assigned a GT of 0/1 or 1/1.
6. A consensus GT was derived by requiring that the GT's from all datasets were identical, excluding datasets with GT="./.", excluding Illumina and 10x GTs for sites in tandem repeats >100bp, excluding PacBio GTs for sites in tandem repeats >10kbp, and excluding all current datasets in segmental duplications >10kbp.
7. Sites were assigned PASS in the FILTER field if HG002 had a PacBio GT that was not homozygous reference and one of the following criteria were met: (a) A consensus GT of 0/1 or 1/1 was assigned for HG002, or (b) BioNano predicted an event <300bp different in size for HG002, or (c) the Nabsys svm_score was >0.9 for HG002. Sites were assigned the FILTER ClusteredCalls if they were within 1000bp of a different call >49bp in HG002 that met the above criteria, so were potentially complex, compound, or inaccurate. Sites were assigned the FILTER MaxEditDistgt04 if they met the above criteria but the maximum edit distance between any 2 calls in the cluster was >40%, so that the sequence change may not be accurate. Sites were assigned the FILTER lt50bp if they met the above criteria but were <50bp in size.
Support for each site, including information about discovery methods in the INFO fields and information about reads supporting REF and ALT alleles from each dataset and BioNano and Nabsys calls in the FORMAT fields. These fields are described in the vcf header.
New to v0.6, we have defined high-confidence regions to accompany the Tier 1 calls. These regions include regions covered by both haplotypes in at least one of 4 diploid assemblies, excluding repeats longer than the read length and segmental duplications >10kb. We then exclude regions (expanded to encompass tandem repeats) around any locations containing likely variants with unclear sequence or genotype, and any likely variants 20-49bp in size. The process used to define the confident regions is described in https://docs.google.com/presentation/d/1htn8yZ6c1jA6c4ZElUZGBAErv7qVkTEXXrjSPXutpC8/edit?usp=sharing.
New to v0.6, we have a Tier 2 callset defining regions in which there is likely to be at least one SV but it is complex or a consensus sequence change could not be determined using our current methods. We have expanded each region to encompass calls within 1000bp and encompass and tandem repeats overlapping the calls. This includes calls with FILTERs MaxEditDistgt04 or ClusteredCalls, as well as calls with the FILTER noConsensusGT if they are not within 1000bp of a Tier 1 variants. It also includes calls that were not tested by svviz in the v0.5.0 integration process, if there were calls multiple calls (even of different type or size) within 1kb from at least 2 technologies (including long reads) or at least 4 callsets.
This callset is expected to improve upon v0.5.0 in a few ways:
1. We have a high-confidence bed file, that is intended to enable users to assess FPs as well as FNs, and defines regions supported by diploid assemblies in which we have confidence there are not extra variants that do not match those in the Tier 1 vcf.
2. We have excluded clusters of differing calls from the Tier 1 vcf PASSing calls, because these calls tended to be imprecise or complex with more than one SV on one allele or on opposite alleles. Similarly we have excluded clusters with maximum edit distance >40% to exclude imprecise calls from the Tier 1 callset.
3. Some additional calls in long tandem repeats are now included since we now ignore inaccurate genotypes from short reads in tandem repeats >100bp or insertions in any tandem repeat.
4. We have added a new Tier 2 bed file that delineates regions where there is evidence for a variants >=50bp, but it is complex and/or we were unable to determine a consensus sequence change. This may be useful for benchmarking a tools ability to detect more challenging SVs, and we are interested in ideas or feedback from anyone who uses this new file.
5. No variants have SVTYPE=COMPLEX. All now are DEL or INS.
6. New annotations TRall, TRgt100, and TRgt10k have been added to
Known and Likely Limitations of this callset:
1. Although many of the Tier 1 calls are challenging (e.g., in long tandem repeats), it likely only includes 50% or less of the total SVs in the genome, and is likely biased towards easier SVs.
2. The predicted sequence change is not always accurate. If multiple methods predicted the same sequence change, we select it, but this is not the case for all sites, and biases can cause the same incorrect sequence change to be predicted.
3. The consensus genotype may be inaccurate in some cases, particularly if the predicted sequence change is inaccurate. The fraction of Mendelian errors for sites genotyped in all 3 members of the trio was ~2%, and more sites were heterozygous in all individuals than expected.
Dear GIAB Analysis Team,
I apologize that the v0.6 Tier 1 bed file defining high-confidence regions is incorrect. We will correct this soon, and will send an update when you can download it to do any evaluations.
Thank you for your patience!
Justin
--
You received this message because you are subscribed to the Google Groups "GIAB Analysis Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
giab-analysis-t...@googlegroups.com.
To post to this group, send email to giab-anal...@googlegroups.com.
Visit this group at https://groups.google.com/group/giab-analysis-team.
To view this discussion on the web visit https://groups.google.com/d/msgid/giab-analysis-team/90DD0079-6221-44FA-A721-6DC89DDBFA64%40nist.gov.
For more options, visit https://groups.google.com/d/optout.