Question about Hi-C coverage for 3D genome assembly

669 views

Skip to first unread message

Mj Hu

unread,

Apr 12, 2018, 3:53:47 PM4/12/18

to 3D Genomics

Hi Aiden Lab members,

I am a little confused about the Hi-C coverage for 3D genome assembly. In your science paper, you used as less as 6.7x sequencing coverage HiC data to help assemble the human genome. Here, I just assume that you just sequenced 6.7x human genome size = ~21.44G Hi-C data. As far as I known, not all the Hi-C sequencing data have the information of the Hi-C contact, for example, the non-ligated DNA fragments to insert between sequencing adapters, as well as some PCR duplication. So if I am right, the true informative hi-c data is less than 6.7x. And the true informative hi-c data percentage may vary between different batches of experiment.

In a word, I am confused about how much Hi-C data I need to help to assembly a genome.

Thanks very much for the help!

Olga Dudchenko

unread,

Apr 20, 2018, 11:19:50 AM4/20/18

to 3D Genomics

Hi Mj,

Your assessment on coverage reported in the Science paper is correct: this is raw data, and indeed not all of it ends up among Hi-C contacts. Note that in case of that particular human library the losses associated with unalignable reads, intra-fragment and mapq0 reads are relatively low, and the Hi-C contact coverage is not very different from the raw one.

This might not be the case from one experiment to another, and in particular in what concerns mapq0 reads: this may depend on the repeat content of the genome you are working with and the technology used to assemble the draft. So one would have to account for that when planning deep sequencing. Furthermore Hi-C assembly is about distribution of reads with 1D distance, and there is no obvious way to access that without at least preliminary 1D assembly unless you are performing reassembly.

The hope is that Hs2-HiC case can serve as a guide for what is necessary in terms of data from a good Hi-C library to assemble from short-read data for a typical mammalian genome to get about 90% of sequence into chromosomes. For a discussion on how to judge the library quality without a chromosome-length reference see some relevant discussion in the Genome Assembly Cookbook. It is perhaps worth noting that 6.7 is a conservative estimate to allow for assembly of very short pieces: if the draft is of very good quality such that most of your contigs are >>15K, even lower amounts of coverage can be sufficient. More coverage however is not detrimental, and given that the prices for short-read Hi-C sequencing are often marginal as compared to DNA-Seq requirements, oversequencing to offset any problems with the Hi-C library, if suspected, is not a terrible idea.

Hope this helps,
Olga

Reply all

Reply to author

Forward

0 new messages