Advice on Hi-C scaffolding

colin

unread,

Jan 20, 2021, 11:11:23 AM1/20/21

to 3D Genomics

Hi!

First of all, thank you so much for the awesome softwares, they are extremely useful!

I am trying to scaffold my plant genome (~1.5Gb according to my draft assembly); the contig (draft) assembly was generated using Canu correction of PromethION long reads (N50 read length ~ 25kb; ~80x used for correction), assembled using SMARTdenovo, and polished (x2) using Illumina reads (~100x) with Pilon; this had a BUSCO completeness of ~70% using the embryophyta DB.

I used Juicer and 3d-dna pipeline with default parameters using Hi-C library prepped by PhaseGenomics to scaffold my draft assembly, but it seems like it is actually fragmenting the assembly more (?). The resulting scaffolding-assembly has 13,237 scaffolds (with lower BUSCO completeness as well), whereas my draft contig-assembly had 3,971 contigs.

I am wondering what could be the reason for this outcome? Is it possible that my draft assembly is bad to begin with? Or is there a process that is going wrong with the scaffolding? Do you have any suggestions on parameters to tweak, or any suggestions on improving upon this? I’ve attached the before (step 0 from 3d-dna) and after (rawchrom from 3d-dna) images of the contact maps below.

Also, here is the output from inter.txt (from Juicer):
"Sequenced Read Pairs: 542,109,179
Normal Paired: 273,911,937 (50.53%)
Chimeric Paired: 102,213,139 (18.85%)
Chimeric Ambiguous: 154,992,903 (28.59%)
Unmapped: 10,991,200 (2.03%)
Ligation Motif Present: 161,328,676 (29.76%)
Alignable (Normal+Chimeric Paired): 376,125,076 (69.38%)
WARN [2021-01-16T16:28:10,827] [Globals.java:138] [main] Development mode is enabled
Unique Reads: 320,371,239 (59.10%)
PCR Duplicates: 55,385,965 (10.22%)
Optical Duplicates: 367,872 (0.07%)
Library Complexity Estimate: 1,146,034,284
Intra-fragment Reads: 13,758,769 (2.54% / 4.29%)
Below MAPQ Threshold: 204,310,869 (37.69% / 63.77%)
Hi-C Contacts: 102,301,601 (18.87% / 31.93%)
Ligation Motif Present: 25,863,586 (4.77% / 8.07%)
3' Bias (Long Range): 62% - 38%
Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%
Inter-chromosomal: 68,398,860 (12.62% / 21.35%)
Intra-chromosomal: 33,902,741 (6.25% / 10.58%)
Short Range (<20Kb): 27,481,129 (5.07% / 8.58%)
Long Range (>20Kb): 6,406,471 (1.18% / 2.00%)"

Any insights on this would help! Please let me know.

Thank you so much!
Colin

3d-dna_genome.rawchrom.hic.pdf

3d-dna_genome.0.pdf

Olga Dudchenko

unread,

Jan 20, 2021, 2:36:38 PM1/20/21

to 3D Genomics

Hello Colin,

Your contact map does not look very good: you have very big coverage biases (load coverage track on your .0.hic map: view -> show annotation panel -> show basic annotations). These may be due to the library (most likely), or to the assembly (I've seen things like this with Oxford contigs), I can't say by just looking at the map. You have high % of reads mapping with mapq 0. Make sure you don't have a whole lot of undercollapsed heterozygosity (closer to telomeres judging by the map you have shared).

I advise you look at your .0.hic with KR normalization on. If you see all chromosomes and you feel that the signal is reasonable, go ahead and try to adjust parameters (e.g. editor-repeat-coverage), remove undercollapsed heterozygosity etc. If not, consider revisiting the library prep.

Olga

colin

unread,

Jan 20, 2021, 3:44:31 PM1/20/21

to 3D Genomics

Hi Olga,

Thanks so much for the prompt response, this is helpful! I understand that it is difficult to diagnose the problem just by looking at the map.

I have loaded the coverage track on my .0.hic map, and (if I am analyzing this correctly?) the coverage seems to be evenly balanced in the .0.hic map, whereas it is highly biased in the rawchrom.hic map. [I have attached the maps with the tracks on here]

I was able to observe all 26 (expected) chromosomes in the .0.hic with KR normalization, and the signal seems reasonable, so I will try adjusting the parameters.

Do you recommend increasing the editor-repeat-coverage to.. say ~ 5? And do you suggest tweaking other parameters?
Do you also have suggestions to remove undercollapsed heterozygosity; would purge haplotigs help with this?

Thank you again!
Colin

0.hic_coveragetrack_balancednorm.pdf

rawchrom.hic_coveragetrack_balancednorm.pdf

Olga Dudchenko

unread,

Jan 20, 2021, 4:00:03 PM1/20/21

to 3D Genomics

It's the same data, so you can't have biased rawchrom if your 0 is not biased. Look at the shoot-outs in the coverage track - they are very considerable, and, as expected from your original map, associated with the centers of the chromosomes. Yes, purge haplotigs is designed for that, but check your telomeres for undercollapsed heterozygosity to make sure you need it. See agwg-merge github for more info on what to look for if unsure.

Look at those shootouts and try to set the editor-repeat-coverage to a value higher than that (at 25kb resolution) [if you mouse over the track the text panel on the right will show you corresponding numbers].

Olga

colin

unread,

Jan 22, 2021, 1:38:38 AM1/22/21

to 3D Genomics

Hi Olga,

Thanks for your advice. I was able to generate an improved contact map repeating 3d-dna pipeline with --editor-repeat-coverage 5 --splitter-coarse-stringency 30 and the following .hic maps for 0 and rawchrom are attached. I roughly observe 26 chromosomes (expected number), and the contact map gets a bit weird after ~1.2Gb.

Perhaps this is due to a faulty contig-assembly? A previous draft assembly (with less ONT reads used for Canu correction) generated a genome size of about 1.2Gb after SMARTdenovo assembly, and I am currently running the juicer/3d-dna pipeline with this assembly to see if it has a smoother scaffolding. I would love to hear your thoughts!

Thank you so much for your help!

Colin

3d-dna_v3.0.hic.pdf

3d-dna_v3.rawchrom.hic.pdf

Olga Dudchenko

unread,

Jan 25, 2021, 9:57:41 AM1/25/21

to 3D Genomics

Colin,

Now tweak your misjoin correction. Notice how the stuff removed is all low coverage? 3d-dna tries to do misjoin correction looking for local drops in coverage along the diagonal. Because your map is noisy it has relatively signal on the diagonal despite the fact that you have a lot of reads. Do misjoin correction at a lower coverage where you have more signal (--editor-coarse-resolution). See cookbook or Dudchenko et al., 2017 supp for how the misjoin detection track looks (depletion_score_wide.at.step.0.wig).

Best,

Olga

colin

unread,

May 20, 2021, 7:09:07 PM5/20/21

to 3D Genomics

Hi Olga and 3D-DNA team,

I'm revisiting this from January... I recently acquired high quality PacBio HiFi CCS reads for my contig-level draft assembly (assembled using hifiasm with Hi-C mode), which worked better than the previous nanopore+illumina assembly. I used the same Hi-C sequencing information with Juicer on this new assembly, and used the default settings on 3d-dna to scaffold.

Here are some estimated characteristics about my genome:

Size: 2n = 1.85Gb (diploid); the current assembly is a haploid genome

Chromosome Count: 2n = 52 (diploid)

Here is the output from inter.txt (from Juicer):

Sequenced Read Pairs: 542,109,179

Normal Paired: 273,878,423 (50.52%)

Chimeric Paired: 113,699,827 (20.97%)

Chimeric Ambiguous: 142,711,672 (26.33%)

Unmapped: 11,819,257 (2.18%)

Ligation Motif Present: 161,328,676 (29.76%)

Alignable (Normal+Chimeric Paired): 387,578,250 (71.49%)

WARN [2021-05-19T10:13:34,181] [Globals.java:138] [main] Development mode is enabled

Unique Reads: 322,278,283 (59.45%)

PCR Duplicates: 64,860,844 (11.96%)

Optical Duplicates: 439,123 (0.08%)

Library Complexity Estimate: 1,022,358,395

Intra-fragment Reads: 14,719,719 (2.72% / 4.57%)

Below MAPQ Threshold: 158,503,171 (29.24% / 49.18%)

Hi-C Contacts: 149,055,393 (27.50% / 46.25%)

Ligation Motif Present: 39,622,702 (7.31% / 12.29%)

3' Bias (Long Range): 62% - 38%

Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%

Inter-chromosomal: 90,961,595 (16.78% / 28.22%)

Intra-chromosomal: 58,093,798 (10.72% / 18.03%)

Short Range (<20Kb): 37,998,143 (7.01% / 11.79%)

Long Range (>20Kb): 20,075,680 (3.70% / 6.23%)

For some reason, after running the 3d-dna pipeline in default mode ('./run-asm-pipeline.sh ../hifiasm.asm.hic.p_ctg.fa ../juicer/aligned/merged_nodups.txt'), I seem to get two very different Hi-C maps between the "0" vs. "rawchrom"... Do you happen to know what might be going on here, and which parameters I can tune to get a better scaffolding result?

Any insights would be super useful!

Thank you again,

Colin

2021.05.20.19.05.53.HiCImage_rawchrom_coverage_balancednorm.pdf

2021.05.20.16.03.42.HiCImage_0_coverage_balancednorm.pdf

colin

unread,

May 24, 2021, 12:45:50 AM5/24/21

to 3D Genomics

Hi Olga & 3D-DNA team,

After tuning some parameters (--editor-repeat-coverage 5 --splitter-coarse-stringency 30), I got a decent looking contact map showcasing 26 expected chromosomes in blue. I do have some low-coverage contacts towards the end, which is sort of consistent where it should fall off with my estimate genome size (~925Mb).

I was wondering if there are any other parameters I could potentially tune before manually assembling these scaffolds! I am attaching the .0.hic and .rawchrom.hic (w/superscaffold track manually edited).

Thanks again for this great tool and any insights would be helpful :-)

Colin

2021.05.24.00.26.39.HiCImage_rawchrom.pdf

2021.05.24.00.26.39.HiCImage_0.pdf

Olga Dudchenko

unread,

May 26, 2021, 12:44:23 AM5/26/21

to 3D Genomics

Colin,

it's the same Hi-C data, so your issues are mostly the same as before: coverage biases and sparsity near the diagonal. You N50 is somewhat better in this assembly, judging by the stats, and you have less undercollapsed heterozygosity (fewer reads with mapq=0). This helps, but the coverage and sparsity is that's causing the difference. You have been correct to apply the suggestions from before.

Olga

Olga Dudchenko

unread,

May 26, 2021, 12:47:20 AM5/26/21

to 3D Genomics

Hey Colin,

Your low coverage contacts are either 1) contaminants or 2) near-perfect repeats. You can mouse over to see the sequence ids and manually examine them in the fasta.

I encourage you to follow the suggestion from before re misjoin detection if you want to improve this further.