Volunteers needed with expertise in small variant calling to evaluate v4.1 benchmark

577 views

Skip to first unread message

Zook, Justin M. (Fed)

unread,

Dec 19, 2019, 3:33:32 PM12/19/19

to GIAB Analysis Team, Wagner, Justin M. (Fed)

Dear GIAB Analysis Team,

Summary: If you have expertise in small variant calling, we’d like your help evaluating our v4.1 benchmark set by January 10 - specifically, as soon as possible send Justin Wagner and Justin Zook your GRCh37 and GRCh38 vcfs for HG002 and we will send you a selection of differences between your vcf and the benchmark to manually curate, which should also help you understand your performance in more difficult regions of the genome.

We are recruiting volunteers to evaluate our v4.1 small variant benchmark set for the GIAB Ashkenazi son HG002 on GRCh37 and GRCh38. An important part of our process to evaluate new benchmarks is to make them available to users who can test the utility and accuracy of the benchmarks. Our primary objective for these benchmarks is that they reliably identify both false negatives (FNs) and false positives (FPs) in any callset being compared to the benchmark (the “query callset”). Specifically, we want to demonstrate that, when callsets from a diverse set of technologies and variant calling methods are compared to our benchmark, the majority (ideally all) of the putative false positives and false negatives for each callset are in fact errors in the query callset.

Similar to our process for the v0.6 structural variant benchmark (https://doi.org/10.1101/664623), we are now asking for volunteers with expertise in a particular small variant calling method to compare your callset to the benchmark for HG002 on GRCh37 and GRCh38. In addition to looking at statistics like precision and recall to ensure they are reasonable, we most importantly ask that you manually curate some of the putative false positives and false negatives to ensure they are in fact errors in your callset and not errors in the benchmark. If you would like to participate, we request that you follow this process (Note that we have simplified this process from the v4.0 draft evaluation from September 2019):

First, send Justin Wagner and Justin Zook your vcf (you can upload to https://drive.google.com/drive/folders/1BrUPsQM6r6Js3HexZYo2VUbv2MQWm92D?usp=sharing).
We will compare your vcf to the v4.1 vcf in the v4.1 bed using hap.py with the vcfeval option (https://github.com/Illumina/hap.py)

Commands that we will use for comparison and stratification are below in this email.

We will randomly select 5 FP SNPs, 5 FN SNPs, 5 FP indels and 5 FN indels, each from inside and outside the v3.3.2 benchmark bed, in GRCh37 and GRCh38 (5*4*2*2=80 total) that are outside the MHC. We will also select 5 FP SNPs, 5 FN SNPs, 5 FP indels and 5 FN indels inside the MHC for GRCh37 only.
Manually curate each of these 80+20 variants using IGV or your preferred genome browser with at least the datasets below. For your convenience, we have created xml sessions that can be loaded into IGV that work with these files remotely, so you only need to load your vcf and any additional files you find useful:

IGV sessions that load the files below, except for yours (load with “Load from URL”)

GRCh37: https://dl.dnanex.us/F/D/xJgf1F5k43Z1jG70qbgbZxPVP1k851BFFzQ3G2X7/igv_session_GRCh37_v4.1_evaluation_GIAB_curation_HG002.xml
GRCh38: https://dl.dnanex.us/F/D/3KjZ0J26Xbqk04bKGVFF1pF748VVpqkpy1Gx2xFx/igv_session_GRCh38_v4.1_evaluation_GIAB_curation_HG002.xml
Note: You may need to adjust IGV settings for these sessions to render optimally. We recommend the following steps: (1) Check that all 3 tracks: “Gaps + Odd Regions”, “Segdups”, “Repeats” (under the ONT track) are visible on screen, and adjust track heights as desired. (2) If needed, uncheck View->Preferences->Alignments->“Hide indels < show indel threshold” and View->Preferences->Third Gen-> “Hide indels < show indel threshold”, as this can make it appear that reads are missing small indels during curation (an example where almost all reads should have a 1bp deletion is chr1:5021621 on GRCh38).

PCR-free Illumina (particularly useful for homopolymers and tandem repeats)

ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/

PacBio CCS (particularly useful for long tandem repeats, LINEs, and segmental duplications)

ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/

10x Genomics (particularly useful for segmental duplications and phased reads)

ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/10XGenomics_ChromiumGenome_LongRanger2.2_Supernova2.0.1_04122018/

ONT (particularly useful for large segmental duplications)

ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/final/

UCSC repeats, segdups, and gap bed files

https://drive.google.com/open?id=166gUCDg_8ICJaSmjLFj6LqiiGXqALgpa

v4.1 draft benchmark vcf and bed
Your vcf
Using resources for additional evidence in these samples is appropriate and, if used, we encourage you to communicate the resource and how you used it.

To get the most reliable results, we ask you to be critical of the benchmark, and select unsure if the evidence does not strongly support the benchmark being correct. NIST will re-curate any variants that you record as unsure or as an error in the benchmark. We will send each participant a link to a google doc with the sites to manually curate
If you wish to further evaluate your calls, you are welcomed to further stratify and curate your FPs and FNs by difficult regions like those available at:
https://github.com/ga4gh/benchmarking-tools/tree/master/resources/stratification-bed-files
We also welcome other types of evaluations of the v4.1 benchmark beyond those described above.

The draft benchmark vcf and bed files are at:

ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.1_SmallVariantDraftBenchmark_12182019/

If possible, we would like your curation results by the end of January 10, 2020,. If you think you need more time, please let us know.

Please email Justin Wagner and Justin Zook a link to your vcfs (you can upload to https://drive.google.com/drive/folders/1BrUPsQM6r6Js3HexZYo2VUbv2MQWm92D?usp=sharing) as soon as possible if you are willing to do the above curations, along with information about, so that we can recruit volunteers for any methods not represented.

Thank you, and let us know if you have any questions!

Justin Wagner and Justin Zook

For your reference, the commands we will use to compare your vcf to the benchmark and select sites for manual curation are below, which output 4 tsv files that have CHROM,POS,REF,ALT,FORMAT,TRUTH GT, QUERY GT

Python script is at https://dl.dnanex.us/F/D/x1fYZ26Fxf1g0fXYxQYP9q6vGPygGbBgKQP4Kz9Y/manual_curation_selection.py

The v3.3.2 bed files are at ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh38/HG002_GRCh38_GIAB_highconf_CG-Illfb-IllsentieonHC-Ion-10XsentieonHC-SOLIDgatkHC_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed

#GRCh37

./hap.py HG002_GRCh37_1_22_v4.1_draft_benchmark.vcf.gz your.vcf.gz -f HG002_GRCh37_1_22_v4.1_draft_benchmark.bed -o truth_v4_GRCh37_query_yourvcf -r genome.fa --engine=vcfeval --engine-vcfeval-template hg19

bedtools intersect -a truth_v4_GRCh37_query_sequelii_DV_vcf_eval.vcf -b HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed -header > v4_GRCh37_query_sequelii_DV_inside_v.3.3.2_highconf.vcf

bedtools subtract -a truth_v4_GRCh37_query_sequelii_DV_vcf_eval.vcf -b HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed -header > v4_GRCh37_query_sequelii_DV_outside_v.3.3.2_highconf.vcf

python manual_curation_selection.py --input_vcf_file v4_GRCh37_query_sequelii_DV_inside_v.3.3.2_highconf.vcf --output_prefix v4_GRCh37_query_sequelii_DV_inside_v.3.3.2_highconf

python manual_curation_selection.py --input_vcf_file v4_GRCh37_query_sequelii_DV_outside_v.3.3.2_highconf.vcf --output_prefix v4_GRCh37_query_sequelii_DV_outside_v.3.3.2_highconf

#GRCh38

./hap.py HG002_GRCh38_1_22_v4.1_draft_benchmark.vcf.gz your.vcf.gz -f HG002_GRCh38_1_22_v4.1_draft_benchmark.bed -o truth_v4_GRCh38_query_yourvcf -r genome.fa --engine=vcfeval --engine-vcfeval-template hg38

bedtools intersect -a truth_v4_GRCh38_query_sequelii_DV_vcf_eval.vcf -b HG002_GRCh38_GIAB_highconf_CG-Illfb-IllsentieonHC-Ion-10XsentieonHC-SOLIDgatkHC_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed -header > v4_GRCh38_query_sequelii_DV_inside_v.3.3.2_highconf.vcf

bedtools subtract -a truth_v4_GRCh38_query_sequelii_DV_vcf_eval.vcf -b HG002_GRCh38_GIAB_highconf_CG-Illfb-IllsentieonHC-Ion-10XsentieonHC-SOLIDgatkHC_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed -header > v4_GRCh38_query_sequelii_DV_outside_v.3.3.2_highconf.vcf

python manual_curation_selection.py --input_vcf_file v4_GRCh38_query_sequelii_DV_inside_v.3.3.2_highconf.vcf --output_prefix v4_GRCh38_query_sequelii_DV_inside_v.3.3.2_highconf

python manual_curation_selection.py --input_vcf_file v4_GRCh38_query_sequelii_DV_outside_v.3.3.2_highconf.vcf --output_prefix v4_GRCh38_query_sequelii_DV_outside_v.3.3.2_highconf

Improvements in v4.1 vs v4.0:

Use new 15kb+20kb CCS data with longer reads and higher coverage
Exclude inversions identified in assemblies of HG002
Better exclusion of CNVs
Exclude V(D)J region that undergoes somatic recombination
Refinement of MHC benchmark in homopolymers and tandem repeats

Improvements in v4.0 vs. v3.3.2:

Use PacBio CCS calls from DeepVariant and GATK. For DeepVariant, exclude RefCalls with GQ<40, but allow calls in homopolymers <11bp
Use 10x Genomics LongRanger variant calls instead of custom, conservative 10x calls
Used targeted diploid assembly of PacBio CCS to create benchmark variants and regions for MHC (6:28477800-31946000 and 6:32017001-33448350 in GRCh37, excluding a misassembly in the region containing CYP21A2, CYP21A1P, TNXA, and TNXB)
For GRCh37, we removed regions that were expanded or compressed in GRCh38 relative to GRCh37 are excluded from our GRCh37 benchmark regions (GRC and minimap2 alignments)
We used the new v0.6 structural variant calls to exclude putative structural variant regions, expanded by 50% on each side
We exclude regions with higher than expected (1.5 times the median) coverage in both PacBio and Illumina or >2 times the median coverage in PacBio-only to exclude potential duplications in HG002
We exclude regions called as duplications by MrCaNaVaR from Illumina WGS, which include segmental duplications in the reference, from short read callable regions
We exclude 15kb on either side of N’s in the reference to minimize errors around gaps in GRCh37 or GRCh38
We exclude regions covered by more than one contig for either haplotype of a diploid assembly generated from PacBio and from ONT
We exclude regions covered by segmental duplications greater than >10kb and more than 5 copies with >99% similarity to each other
We exclude L1H LINEs >500bp in length from short read callable regions to minimize clusters of variants missed by short reads
We exclude difficult to map regions from short reads
We now use additional tandem repeat and low complexity annotations from RepeatMasker and TRF because our previous bed files had missed excluding some long tandem repeats from short and linked read methods
Remove small benchmark regions <50bp in size
Now use a segmental duplications bed file for GRCh38 instead of the self-chain, since the GRCh38 self-chain had some problems

Reply all

Reply to author

Forward

0 new messages