GIAB Machine Learning Variant Caller Call Summaries

66 views
Skip to first unread message

Zook, Justin M. (Fed)

unread,
Jun 11, 2021, 2:02:22 PM6/11/21
to GIAB Analysis Team

Dear GIAB Analysis Team,

 

Thanks very much to Andrew Carroll, Kishwar Shafin, Mohammad Sahraeian, and Maria Nattestad for presentations on the GIAB Machine Learning calls over the past few months, as we explore possibilities for using these methods to improve characterization of GIAB samples, including assigning uncertainties and explaining the variant calls.  We’ve included brief summaries of each of these calls below, as well as the NIST GIAB Team presentation about uncertainty calibration in February.  We’ve also made available slides for some of the presentations at https://drive.google.com/drive/folders/1oyGSWVNAPPvpAUFVbfLEcQOTK8TWSUeI?usp=sharing. We will not be having a GIAB call next Monday, June 14. 

 

Best regards,

Justin Zook and Justin Wagner

 

February 1, 2021

 

The NIST GIAB team presented Calibration of Uncertainty Estimates in Genomic Variant Calling (slides attached), examining the calibration of the QUAL score for deep learning-based precisionFDA Truth Challenge V2 submissions. We reviewed definitions of variant confidence scores in VCF (e.g., QUAL and GQ), as well as how QUAL is calculated in the participating deep learning variant callers. An empirical QUAL score was then generated for each submission using the false positive rate from hap.py in each QUAL score bin. The initial analysis showed how the QUAL scores assigned by the variant callers performed against the empirical QUAL scores. An interesting result was that QUAL scores were higher than expected when compared to performance against the GIAB benchmark in low mappability and segmental duplications. Audience questions and discussions included approaches to continue variant confidence score analysis, future work analyzing genotype quality (GQ) calibration, how QUAL scores could be influenced by biases in sequencing platform or variant caller method, and methods to consider when expanding the GIAB benchmarks to express benchmark variant uncertainty.

 

March 1, 2021

 

Andrew Carroll presented ideas for using DeepVariant to help evaluate benchmarks that are slated for development (v4.2.1 HG005,HG006,HG007,HG001). He detailed three proposals including (1) train a model with HG002 and then evaluate performance on HG005, and vice versa; (2) Compare models’ estimation of genotype quality when using different samples or many samples together; (3) Check different models’ and technologies’ estimation of GQ in different GA4GH/GIAB genome stratifications. Questions and discussions included how HG005/6/7/1 could be used best for training, if genome stratifications could be useful during training, and training on multiple technologies similar to one of the pFDA Truth Challenge V2 assignments.

 

March 29, 2021

 

Kishwar Shaffin from UCSC presented on using PEPPER-Margin-DeepVariant for long-read variant calling. A bioRxiv pre-print describing PEPPER-Margin-DeepVariant is available at (https://www.biorxiv.org/content/10.1101/2021.03.04.433952v1.full.pdf). Kishwar detailed the variant calling pipeline including alignment, SNP-based haplotyping to generate a phased BAM, haplotype-aware variant calling, and generation of a phased VCF as output. This pipeline was created to handle a complication with using ONT data with the normal DeepVariant pipeline because it identifies too many false positives candidates due to noise in the ONT reads. PEPPER performs candidate finding, DeepVariant is used for variant calling, and Margin produces a phased VCF. Kishwar also detailed a new application of polishing Shasta assemblies of ONT data with PEPPER-Margin-DeepVariant. Audience questions and discussion included the types of information used for phasing, use of reference assemblies as compared to diploid assemblies, and how different sequencing data types could be used to improve the quality of variant calling for GIAB reference samples.

 

 

April 26,, 2021

 

Mohammad Sahraeian from Roche presented on NeuSomatic (https://github.com/bioinform/neusomatic), a deep-learning based variant caller that was originally developed for somatic variant calling. NeoSomatic was modified to call germline variants for the pFDA Truth Challenge V2. NeuSomatic takes in alignments of reads and can generate an augmented reference through adding gaps to handle insertions. The read alignment data is then processed to generate matrices that summarize support for the reference and variants, which are then used as input to a network. The architecture contains 4 ResNet-like blocks along with 9 convolution layers that classify mutation type and length along with a position regression. Germline variant calling requires post-processing to handle multi-allelic variants. Audience questions and discussions included robustness to differences in read lengths and how variant quality scores are calculated.

 

May 24,, 2021

 

Maria Nattestad presented Google’s DeepVariant team’s work on understanding and explaining variant calling performance, some of which is included in a new blog about the history of Deepvariant from the google team: https://google.github.io/deepvariant/posts/2021-06-08-deepvariant-over-the-years/. Maria described  DeepVariant as a sequencing data visualization + image classification program that  uses Google’s Nucleus library to process read data. DeepVariant identifies candidate variants using the “Very Sensitive Caller”, then  classifies each candidate into one of three classes (Homozygous Reference, Heterozygous Variant, Homozygous Variant). DeepVariant generates a specialized pileup image that has the putative variant in the middle with 221 bases around putative variant, showing 95 reads + 5 reference rows + 6 channels encoding alignment and variant support data. DeepVariant has been trained on a variety of sequencing data, including WGS from Illumina HiSeq, NovaSeq x PCR+, PCR-free, Illumina WES, PacBio HiFi, and Multiple Technologies. The DeepVariant team has explored several approaches to explain classifications, including saliency maps for feature importance in input, tSNE visualizations of examples at different layers of the network, and knocking out channels of the input tensor to determine importance. Audience questions and discussions included INDEL realignment, how insertions are represented, and the importance of the current NN architecture that is used. 

 

 

Reply all
Reply to author
Forward
0 new messages