Second Ultralong ONT data release for HG002

250 views

Skip to first unread message

Zook, Justin (Fed)

unread,

Aug 23, 2018, 2:51:23 PM8/23/18

to giab-anal...@googlegroups.com

Dear GIAB Analysis Team,

Our second public release of “ultralong” reads from Oxford Nanopore for the GIAB Ashkenazi son (HG002) is now available. We now have ~16x coverage with mapped N50 ~50kb, >4x reads >100kb, and a few >1Mb. The README, fastq, and cram files, including a cram file haplotype-partitioned with whatshap, as well as a pdf with coverage vs. read length, are available here:

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/combined_2018-08-10/

Detailed information from README:

"Ultralong" Sequencing Data Release

2018-08-10 - 16x genome coverage of HG002

This the second release with an additional 9X coverage from the previous release, making a total coverage of 16X and 4X coverage by reads > 100kb.

Oxford Nanopore sequencing of "ultralong" (Jain, et al 2018) libraries was performed on Genome in a Bottle cell line HG002 (GM24385). Additional sequencing is underway, so this dataset will continue to grow, with approximate projected coverage of 40X with 10X coverage by reads > 100kb made available by the end of 2018.

DNA was prepared using various modified versions of Josh Quick's protocol (https://dx.doi.org/10.17504/protocols.io.mrxc57n). DNA fragment length varies quite substantially from prep to prep, but overall statistics are provided in the accompanying .qc.txt files.

Sequencing was primarily performed using a mix of SQK-RAD003 and SQK-RAD004 library prep kits on the FLO-MIN106. Additional sequencing was performed using the MinIon,GridIon and PROMETHION. Raw sequence data (fast5) are included in the release as 10 tar files. Reads were base-called with albacore (version 2.3.1), then combined into a single fastq file.

The combined fastq file was separately aligned to the hs37d5 and hg38 human genome reference assemblies using Heng Li's minimap2 (https://github.com/lh3/minimap2) with the following options:

-a -z 600,200 -x map-ont

These settings are subject to change in subsequent releases. In particular, the "Z-drop" score was increased from the default of 400 to 600 in order to increase the contiguity of ultralong read alignments. In our initial analyses, this results in improved alignments (more contiguous), but please let us know if you think this setting is producing alignments that are more contiguous than they should be!

The resulting aligned reads were sorted and compressed in cram format (similar to bam format). Note that this is partly for historical reasons, as bam format only recently adopted conventions allowing for the long CIGAR strings found in our ultralong read sequencing.

We have also worked with Tobias Marschall and Jana Ebler to use whatshap haplotag to partition the nanopore reads by haplotype:

whatshap haplotag -o HG002_ONTrel2_16x_RG_HP10xtrioRTG.cram -r hs37d5.fa RTG.hg19.10x.trio-whatshap.vcf.gz HG002_ONTrel2_16x_RG.cram

For haplotype assignment, we used a vcf phased using 10x Genomics-based and trio-based phasing located at:

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_MPI_whatshap/RTG.hg19.10x.trio-whatshap.vcf.gz

Sequencing contributors:

NIST

- David Catoe

- Marc Salit

- Noah Spies

- Nathan Olson

- Jenny McDaniel

- Justin Zook

Nottingham

- Matt Loose

Birmingham

- Nick Loman

- Josh Quick (extraction, libraries, sequencing)

- Andrew Beggs, Louise Tee, and Oliver J Pickles (cell culture)

Reply all

Reply to author

Forward

0 new messages