Fragmented scaffolds - large and repetitive genome

219 views
Skip to first unread message

sighex

unread,
Oct 11, 2022, 11:47:04 PM10/11/22
to 3D Genomics
Hi. Thanks for guiding me to this forum. We have got a 3d-dna output using 30x HiFi data and in-house Hi-C for a large, highly repetitive genome of about 5Gb, shown in a contact map on Juicebox like this.
   
195223140-a5a8ee97-909e-486e-a34b-a7d5c6ad1c9d.png
We see 1) a huge fraction for which we could not retrieve chromatin contact and 2) a number of signals far away from the diagonal line, that is, fragments that failed to be resolved into chromosomal scaffolds. I expect that at least the fractions 2) can be better resolved into chromosomal scaffolds by tweaking with parameters. Any recommendations?

I would really appreciate your response.

Best regards,

Shige

Olga Dudchenko

unread,
Oct 14, 2022, 5:55:17 PM10/14/22
to 3D Genomics
Hi Shige,

It is hard to judge from the figure what's the source of the problem, but you want to investigate it as follows. There are usually two reasons that too much stuff is "thrown out" by the edit step. One is data is too sparse near the diagonal, but I don't think it's the case looking at the figure, looks dense and clean. Second, is that stuff looks like it has overly high coverage compared to the rest of the genome due to peculiarities of the library prep/genome itself/large presence of repeat/undercollapsed heterozygosity/GC etc. If the latter reason is the case then when you load the coverage track (view->show anotations->1D annotations->basic annotations) you'll see that the thrown out bit after the main assembly has elevated coverage. If you are certain this is something that looks genuine and there are no other problems with all you need to do is to add something like --editor-repeat-coverage 5 to the run to increase the threshold for flagging sequences receiving suspiciously high amounts of of reads compared to the average across the genome.

Best,
Olga

Shigehiro Kuraku

unread,
Oct 15, 2022, 1:53:12 AM10/15/22
to 3d-ge...@googlegroups.com
Thank you very much, Olga. As you suggested, I have displayed the coverage, and found clear trends. I actually got this scaffolding result with the option  --editor-repeat-coverage 15. In this case, what else would you recommend, if we stick to this Hi-C library. Is this because the regions thrown out were extremely inaccessible which resulted in low Hi-C data coverage?

Shige

image.png

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/ba886251-98c6-4dfe-a31a-ce2620ac1c26n%40googlegroups.com.

Olga Dudchenko

unread,
Oct 27, 2022, 12:49:15 AM10/27/22
to 3D Genomics
I would examine the thrown out stuff a bit closer. Can you zoom in and show me what you are seeing in the discarded portion? They don't look like spread repeats to me, they seem to have a clear preference for a place in a genome. Are those duplicated bits representing undercollapsed heterozygosity or something? You can incorporate them, but I'd advise understanding what they are first. -Olga
Reply all
Reply to author
Forward
0 new messages