Dear GIAB Analysis Team,
Justin Zook presented context for the proposed RNA sequencing of GIAB samples, setting the scope of generating an initial public RNA sequencing data resource, but not starting a large new effort to develop benchmarks for RNA sequencing at this time. The limited funding for RNA sequencing is part of an internal NIST collaboration to develop machine learning methods for genomics and proteomics, and proteomics data will also be generated for GIAB samples. Over the years a number of groups have reached out to GIAB with interest in RNAseq data and benchmark sets. As a well characterized and fully consented set of samples, extensive epigenetic data was generated for GIAB samples to compare methylation and ATACseq sequencing methods in the recent epiQC project led by Chris Mason and Jonathan Foox, https://www.biorxiv.org/content/10.1101/2020.12.14.421529v1. They will present this work on the GIAB call scheduled for April 19. Marc Salit asked about epigenetic variability between cell line growths, which is also a potentially important source of variability for RNA sequencing. This was beyond the scope of the initial EpiQC study, and likely beyond the scope of this initial GIAB RNA sequencing project as well. Focusing on annotation and isoform analysis use cases may be less sensitive to these sources of variation.
A number of people identified potential use cases for the new RNAseq dataset including use in training AI to predict the impact of splicing variants on isoforms, a qualitative isoform benchmark set, and gene annotation. Liz Tseng from PacBio was particularly excited about the prospects of a RNA benchmark set, stating “I drool every time someone shows me a benchmarked result against GIAB genome samples. …like, if GIAB releases a ‘truth RNA"’dataset, i'd like to be able to say, everything this ‘truth’ set contains, it must really be there, and it's ok if low/rare stuff is missed”. A consensus was achieved around Liz’s idea for generating a qualitative transcript isoform set.
For sample diversity, LCL cell lines are available for the two GIAB trios. Corriell also has iPSCs derived from the LCLs for HG002 (Ashkenazi son), HG004 (Ashkenazi mother), and HG005 (Han Chinese son), as well as an iPSC derived directly from PBMCs for HG002.
The proposed plan would be to initially sequence HG002, HG004, and HG005 RNA from the 3 LCLs and 4 iPSCs using three platforms – ONT, PacBio, and Illumina. The resulting dataset would be made public, and GIAB would seek collaborators to help with analyses such as gene annotation, isoform analysis, and developing a possible qualitative transcriptome benchmark. GIAB is also interested in assistance from anyone able to contribute RNA sequencing to this effort.
Future work, likely outside the scope of the initial project, could include investigating RNAseq of differentiated iPSCs, differential expression, and RNA reference samples. NIST has a postdoctoral opportunity that could take advantage of these unique data and pursue these types of future applications at https://tinyurl.com/ybx5mxwx.
Please reach out to the NIST team if you are interested in collaborating in the sequencing and/or analyses of these data, as this will help us plan for optimizing the impact of this work.
Thanks,
NIST-GIAB Team
The next GIAB Machine Learning call is planned for 3pm EDT (12pm PDT) on 2 PM March 29, and Kishwar Shafin from UCSC will present his work developing PEPPER-DeepVariant and work towards diploid assembly polishing.