Optimizing scaffolding of a genome with tetrasomic inheritance

Nate Backenstose

unread,

Jun 1, 2021, 10:41:01 PM6/1/21

to 3D Genomics

Hello 3D genomics group,

It's my first time scaffolding a genome and I have gone through some of the conversations here, but would like to reach out for some advice. Estimated size 2.6 Gb, 40 chromosomes, and this genome also may have some residual tetrasomically inherited regions that could encompass up to 10%

We sequenced with Oxford Nanopore and assembled with Flye then polished with Illumina reads using Pilon. Haplotigs were purged and the genome was integrated with a linkage map using chromonomer. We then used the juicer and 3D-DNA pipeline running the latter with multiple sets of parameters. I gather after reading some of the conversations here that I do not have a lot of data as seen with the stats from juicer below (added in second message), and that might be contributing to the scaffolding errors.

The best results I have been getting are with reducing stringency values for the editor, polisher, and splitter to a value of 10. I have also tried running with an early exit with the "-r 0" flag and changing the resolutions to from 100kb down to 25kb in 25kb increments. For the "*.0.hic" plot I am able to count around 40 chromosomes, but there are some overlap between them. For the ".final.hic" the chromosomes I am seeing 36 chromosome but they are overall not close to the estimated genome size and I am getting lots of debris in the final map. Any advice you might have would be greatly appreciated and please let me know if I can provide any additional information.

Thank you for supporting this software with this community. I am learning a lot.

Nate

Nate Backenstose

unread,

Jun 1, 2021, 10:41:25 PM6/1/21

to 3D Genomics

___________

Sequenced Read Pairs: 22,500,000

Normal Paired: 7,670,178 (34.09%)

Chimeric Paired: 7,366,570 (32.74%)

Chimeric Ambiguous: 6,821,925 (30.32%)

Unmapped: 641,327 (2.85%)

Ligation Motif Present: 7,825 (0.03%)

Alignable (Normal+Chimeric Paired): 15,036,748 (66.83%)

Intra-fragment Reads: 374,714 (1.67% / 2.62%)

Below MAPQ Threshold: 6,716,042 (29.85% / 47.02%)

Hi-C Contacts: 7,193,360 (31.97% / 50.36%)

Ligation Motif Present: 4,343,550 (19.30% / 30.41%)

3' Bias (Long Range): 73% - 27%

Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%

Inter-chromosomal: 2,856,492 (12.70% / 20.00%)

Intra-chromosomal: 4,336,868 (19.27% / 30.36%)

Short Range (<20Kb): 2,302,164 (10.23% / 16.12%)

Long Range (>20Kb): 2,034,619 (9.04% / 14.24%)

---------------------

/run-asm-pipeline.sh --editor-coarse-stringency 10 --polisher-coarse-stringency 10 --splitter-coarse-stringency 10 $references/CA_GEN_1-flye_III-pilon2-purged.chromonomer.fasta $aligned/merged_nodups.txt

----------------------

Nate Backenstose

unread,

Jun 1, 2021, 10:45:46 PM6/1/21

to 3D Genomics

images of maps (sorry had to split up all these posts)

.0.hic.JPG

.final.hic.JPG

Olga Dudchenko

unread,

Jun 22, 2021, 11:27:31 AM6/22/21

to 3D Genomics

Hi Nate,

Great job exploring the issues of parameter adjustment on the forum! Looking at your data I can see the following. You indeed have very little data for such a large genome. You are trying to scaffolding with just 2.8M reads! For a genome of your size you'd probably want something like 100M+ at least, especially given that you loose a lot to alignment, probably due to highly syntenic regions. I would highly recommend sequencing deeper even though your draft is relatively contiguous (you don't list the N50 but it seems ok'ish given the stats).

You probably want to zoom in into those overlapping regions to understand what are those. Are those some heterochromatic repeats (like between chr 7 and 8 on .0.hic)? Collapsed repeats (like between #5 and #6)?

Judging by your tracks your might want to increase the editor repeat coverage threshold (everything's off due to those problematic regions, so the default threashold of 2x at 25kb is not working: you can see that way too much of your genome is flagged as problematic in the repeats_wide track). Your depletions score is also not working, again as seen from the track. When you have so little data you'll have to settle to doing all of your editing at a lower res, so try not tinkering with stringency (not going to help, you bins are practially empty due to sparsity of your data), but rather set your editor resolution to something closer to 100kb or something.