Summary of GIAB 2021 Technical Germline Benchmark Roadmap Meeting

68 views
Skip to first unread message

Zook, Justin M. (Fed)

unread,
Nov 24, 2020, 4:03:22 PM11/24/20
to genome-in...@googlegroups.com, GIAB Analysis Team

Dear GIAB Community,

 

Executive Summary: GIAB held a special call on Nov. 9, 2020 to discuss the technical roadmap for GIAB’s germline benchmarks in the next 1-2 years. Justin Zook presented an overview of GIAB’s progress towards benchmarks for challenging regions and variants, as well as GIAB’s collaborative work towards assembly-based benchmarks (https://www2.slideshare.net/GenomeInABottle/giab-technical-germline-benchmark-roadmap-discussion). Important points from the discussion included:

  1. GIAB has made important progress towards benchmarks for challenging regions in 2020.
  2. GIAB’s collaborations with the Human Pangenome Reference Consortium and Telomere to Telomere Consortium are a good opportunity to ensure GIAB’s benchmarks remain relevant as technologies and analysis methods continue to improve. New GIAB benchmarks developed collaboratively with HPRC/T2T should take advantage of diploid assemblies that have rapidly increased in quality, resolving challenging regions of the genome and structural variants excluded from current GIAB benchmarks, including those in many known medically relevant genes.
  3. Longer term, leveraging the development effort from HPRC/T2T in germline genome characterization may enable GIAB to expand its efforts towards additional priorities identified on the call, including benchmarks for somatic variants, improved usability of benchmarking tools, and benchmarking tools for more challenging variants. At the end of this email, we request your feedback about the relative impact of potential future GIAB deliverables on your work.

 

Justin Zook and Marc Salit introduced the goals of the call to discuss the technical roadmap for developing germline benchmarks in GIAB. To provide context, Justin Zook provided an overview of the consortium’s progress (https://www2.slideshare.net/GenomeInABottle/giab-technical-germline-benchmark-roadmap-discussion). In 2020 GIAB published the V0.6 SV benchmark set in Nature Biotechnology (https://rdcu.be/b4UMa) and the first assembly-based benchmark set (for the highly variable MHC region) in Nature Communications (https://www.nature.com/articles/s41467-020-18564-9). GIAB also released v4 small variant benchmarks for the Ashkenazi Trio, using long and linked reads to cover challenging genomic regions (https://doi.org/10.1101/2020.07.24.212712). Members of GIAB also collaborated with the Human Pangenome Reference Consortium (HPRC) and Telomere-to-Telomere Consortium (T2T) to use GIAB data and benchmarks to advance de novo assembly. The HPRC used HG002 as their pilot genome to evaluate assembly approaches, for which the GIAB team developed a novel assembly-based benchmarking pipeline using GIAB’s v4 small variant benchmark and GA4GH/GIAB stratifications (https://humanpangenome.org/hg002/).   This benchmarking demonstrated that the best assemblies can give highly accurate variant calls relative to the v4 benchmark, and are promising for expanding the GIAB benchmarks into the remaining regions that are difficult for traditional mapping-based approaches. The GIAB-team also worked with the precisionFDA team to run a follow-up to the 2016 truth challenge focused on evaluating variant calling methods in difficult-to-map regions and the MHC, capturing advances in sequencing, variant calling, and the benchmark sets in the four years since the initial challenge (https://precision.fda.gov/challenges/10/view/results, https://doi.org/10.1101/2020.11.13.380741). A number of new datasets were also released in 2020 including, ultra-long ONT sequencing data, PacBio HiFi data, and strand-seq data.

 

Justin went on to outline the proposed roadmap for 2021. The roadmap emphasized development of new benchmark sets for GIAB’s genomes, including V4 benchmark sets for all seven genomes on GRCh37 and GRCh38, and using phased/diploid assembly-based methods going forward.  As an initial high-impact proof-of-principle for using whole genome diploid assembly, GIAB has a working group using the highest performing HG002 diploid assembly from the HPRC bakeoff (trio-hifiasm) to create phased small and structural variant benchmarks for medically relevant genes not covered well by v4. Karen Miga and Justin Zook have been discussing ongoing opportunities for collaboration with the HPRC and T2T to expand GIAB benchmarks to the X and Y chromosomes and some of the most challenging regions of the genome and variant types. Additionally, Justin described plans for acquiring new datasets, including additional strand-seq and ultra-long ONT data, as well as RNAseq and proteomics data. The NIST-GIAB team is receiving additional support to develop new AI methods for explainable characterizations with quantified uncertainty values for NIST’s genomic reference materials.

The proposed roadmap was discussed for the remainder of the call. The main topics included: (1) Use of HG002 as primary genome for methods development and relative priority for characterization of additional genomes. (2) The need for somatic benchmark set and benchmarking methods. (3) Prioritizing benchmarking method development. (4) Outlook for 2024, 2025, and beyond.

For the last few years, the GIAB-team has used HG002 as its primary genome for methods development. While there is an overhead for characterizing additional genomes, consortium members on the call generally agreed that there is value in continuing to characterize GIAB genomes beyond HG002. Noushin Ghaffari said one advantage of generating benchmark sets for all genomes is that it can improve our understanding of the methods used to generate the benchmark sets. Having multiple benchmark samples also helps minimize over-tuning methods to one sample, and many laboratories still use HG001 as their primary sample.  However, characterizing all samples comes at the cost of creating benchmarks for more challenging genes, regions, and variants in HG002 that users are currently blind to. We are interested in further input about the relative value of expanding the benchmark for HG002 vs characterizing more or all GIAB samples for particular use cases.

 

While not the focus of this call, a number of people on the call identified the ongoing need for somatic benchmark sets and benchmarking methods. The GIAB-team is working to characterize mosaic variants in HG002 so it can be used as a negative control for somatic variant detection, and as an initial step to the development of a somatic benchmark set. Additionally, because current tumor-normal cell line pairs were not consented or do not have an appropriate consent for genomic reference material development, the NIST team is continuing to work on acquiring material for tumor-normal reference materials.  Even without these samples, Steve Lincoln proposed a “Somatic Benchmarking Kit” which would define best practices for using mixtures of GIAB DNA to benchmark somatic variant calling with the GIAB’s characterization of mosaic variants. Andrew Carroll also brought up the value of getting a somatic benchmark set out sooner rather than later, to establish GIAB as a leader in the field.

Much of the roadmap discussion revolved around the need for benchmarking method development. Steve Lincoln identified two main challenges associated with benchmarking; 1) current tool usability, and 2) that current benchmarking methods are unable to handle difficult variants. Current benchmarking methods require a level of bioinformatic expertise. More “turn-key” benchmarking methods would increase their adoption for both the research and clinical diagnostic communities, making it easier for diagnostic assay developers to do validation studies. Since some tools already exist that enable users to simply upload a vcf and receive performance metrics in a web interface, such as on precision.fda.gov, Fritz Sedlazeck suggested outreach work such asbenchmarking tutorial videos may help. Regarding development of benchmarking methods capable of handling complex variants, Justin Zook believes these new methods are likely to come from collaborations with HPRC as they develop pangenomic representations of these variants. GIAB has a diverse set of stakeholders from biotech, academia/ research, clinical diagnostic labs. While GIAB has not developed software tools as products thus far, Marc Salit said a more formal product development approach could guide tool and resource development to better address our diverse stakeholder community’s needs.

 

Fritz Sedlazeck spearheaded a forward-looking discussion with asking what a roadmap for 2024, 2025, and beyond would look like. Additional characterizations such as RNA, somatic, single-cell were discussed. Marc Salit talked about an idea for using differentiated iPSC cells as the basis for a suite of “omics” reference materials. Justin Zook noted that the presented roadmap was developed partly to take advantage of GIAB’s collaborations with the HPRC and T2T, large NHGRI-funded consortia leading development of new assembly and pangenome methods that have already shown promise for improving GIAB’s benchmarks.

Request for feedback:

We greatly appreciate your ongoing feedback about GIAB’s roadmap and the relative priority of the discussed deliverables below and any other deliverables that might impact your work.  We’d be interested in your ranking of some or all of these by priority, with an explanation of why they are important for your or others’ use cases:

  1. Improved structural variant benchmarks that include more complex types like inversions and challenging regions like those in seg dups.
  2. Improved tandem repeat benchmarks for GIAB samples
  3. Benchmarks for copy number variable segmental duplications (e.g., KIR genes)
  4. Benchmarks for chrX and chrY
  5. Benchmarks for HG005 and HG001 in addition to HG002 (or benchmarks for all 7 GIAB genomes)
  6. GIAB benchmarks for new samples from other ancestries
  7. Best practices for using mixtures of GIAB samples to benchmark somatic variant calling
  8. Improved usability of existing benchmarking tools (e.g., reports that visualize stratifications of interest, reports that compare performance to other tools, additional web-based interfaces for tools, reports relevant for clinical validation)
  9. New benchmarking tools for variant types that can’t currently be compared robustly (e.g., complex SVs, tandem repeats, copy number increases, satellite DNA)
  10. Improved benchmarking tools for de novo assemblies
  11. Improved GIAB data curation and development of an interface for searching for datasets by sample, technology, etc.

 

Thank you!

Justin and Marc

 

Reply all
Reply to author
Forward
0 new messages