JuiceBox Manual Edits and X Chromosome

709 views
Skip to first unread message

Thomas Malachowski

unread,
Oct 22, 2021, 11:36:48 AM10/22/21
to 3D Genomics
Hi All,

Thank you so much for developing these amazing tools for HiC analyses and genome assembly. I had some questions about some of the 3D-DNA parameters and specifically how to properly do manual corrections with JuiceBox. I have never worked with HiC data before attempting to assemble a genome de novo so I have some confusion with interpreting the interaction map. 

The data I have is paired-end 150x2 bp Illumina reads. I am reconstructing a human genome and I have 24X theoretical genomic coverage. I adapter trimmed and lightly quality trimmed the ends of the reads (not the middle - no sliding window trimming). Reads less than 60bp were thrown away since w2rap-contigger builds a 60-mer graph first before expanding to a larger de brujin graph. I used a kmer size of 136 for my reads (instead of default 200 which is larger than my read size). This produced an assembly with an N50 of 72kbp. 
I used Juicer with the Arima and Early Exit options. The HiC reads were also adapter trimmed and lightly quality trimmed only at the ends. The theoretical coverage is 14X.
3D-DNA was run with default settings and manually corrected.

I loaded the .0.hic file and the appropriate tracks. From my understanding, there does not seem to be too much coverage bias (generally everything stays around 1 - although there are some regions of no coverage). The depletion saturation does not look terrible and neither do the repeat tracks, I think.

1) For this I was wondering if any parameters should be edited to make a better assembly since the final estimated genome size appears to only be 2.35Gbp and not closer to the 2.8Gbp I was expecting (based on the effective genome size). I figured the repeat editor should not be changed based on other discussions on this forum. Should the saturation or resolution of the coarse or fine editor be changed? 

For the manual assembly, I thought it overall looked good besides having lower estimated genome size than expected but I am unsure if I made the correct manual edits. I attached the rawchrom.hic (before manual edits) and the final.hic (after manual edits). 

2) Did I interpret the data correctly or did I overcorrect it and fused chromosomes together that are actually separate.

3) Is it possible to produce the inter.txt file by using an individual Juicer command since it is not produced if run with early exit?

4) My last question is how would I go about creating an assembly that has a better super scaffold corresponding to the X and Y chromosomes? I am particularly interested in the X chromosome, but my assembly cannot seem to assembly a proper X chromosome unless I use PafScaff (which I believe was used in some of your other papers recently) but I would ideally like to avoid this since it would involve mapping the smaller scaffolds to the X chromosome. My current method is to locate the super scaffolds with Mashmap (only consider scaffolds over 1MB) and then use seqtk to extract those fasta sequences which generally gives me about 23-28 chromosome-scale scaffolds.

I also attached a circos plot of my contigs mapped to Hg38 to show my concerns about X chromosome coverage and my overcorrections.

Links (I am unable to upload, please see google drive links):
Sample.0.hic
Sample.rawchrom.hic
Sample.final.hic
CircosMapping

Any information you have would be really appreciated! Thanks so much for your help! 

Best Wishes,

Olga Dudchenko

unread,
Nov 5, 2021, 12:37:46 PM11/5/21
to 3D Genomics
Hey Thomas,

Thanks for the feedback and thanks for the detailed email, this was a pleasure to read! Sorry for the delay in answering, I'm on leave. To answer your questions:

1) The run looks fine to me, I agree with your diagnosis. I am guessing that when you refer to 2.35 you are talking only about the anchored part of the assembly? Note that by default 3d-dna will only try to scaffold sequences longer than 15kb in the draft (-i parameter). So, if you are running with default, you expect to anchor roughly everything that's >=15kb. That's primarily what's probably driving the length of the anchored sequences. If you see a large discrepancy, that would be the reason to investigate. Often this is due to overzealous annotation of repeats, but I do not think this is the case judging by the .0. tracks (although I cannot say for sure from the static image and I don't see the crucial part at the end anyway). You can set -i for lower, but your coverage is fairly low and you will loose noticeably on accuracy of the anchored sequences. If anchoring is more important to you and local accuracy though, that's certainly an avenue to explore.

2) My guess is that .final.hic is the result of your manual correction. I can see some errors that you've introduced. E.g. 9th HiC_scaffold seems to be a fusion maybe? Again, hard to judge from static images. HiC_scaffold_18 appears to have some misjoins maybe. Just in case you have not seen the tutorial and the tetris video, take a look (dnazoo.org/methods), perhaps it will help build intuition. 

3) Newer versions of Juicer will produce a stats file. Note however that it will not be helpful in the classical way since the stats will be calculated with respect to the draft. See some discussion of this in the Hi-C library prep paragraph of the Genome Assembly Cookbook. If you want stats with respect to the final assembly you should just rerun Juicer with respect to the chromosome-length fasta.

4) Nope, I don't know what's PafScaff. Hard to give recommendations since I don't know what's the problem with the current assembly of the X chromosome. Can you share some screenshots or explain in a bit more detail? Are you doing mails and PAR is causing your problems? I think if I interpret your circos plots correctly my guess is that because your are trying to do males your X coverage is low and maybe it all gets removed due to poor signal? Try to search for X in .0.hic scaffolds and see what the tracks look like, if it all ends up being "suspect". You can then tweak to address. You should be albe to find X on .0.hic by loading the coverage track: in male samples X will have lower coverage than the rest of the chromosome clusters.

Best,
Olga

Thomas Malachowski

unread,
Dec 14, 2021, 10:53:03 AM12/14/21
to 3D Genomics
Hi Olga,

Sorry for the long gap in response - things have gotten quite hectic as the end of the year approaches. Thanks for all the suggestions and explanations; it was very helpful!

1) After checking my assembly stats (specifically the N#), it does not seem like there is any discrepancy between my N# for 15kbp (usually for 15kbp it is between N80-N87 which assembles 2.3-2.6 Gbp - I have 5 unique assemblies). I will try "-i 10000" as well.

2) Thank you for reminding me of the video! I watched it originally when I first started this project a while ago, but with the additional knowledge I have now, it was much more meaningful and did help build some intuition. 

3) Thank you for the suggestion! I will attempt to re-run Juicer using the new assembly to produce relevant stats I can use for comparison.

4) I believe my issue is likely due to coverage in the WGS data as I am working with male samples. I have 5 assemblies ranging from 19-37X in PCR-Free WGS coverage. The assemblies that had > 30X coverage were able to assemble an appreciable portion of the X and Y chromosome (I believe I did notice some PAR issues based on the HiC data) while the assemblies with < 30X could not assemble either chromosome. I am not exactly sure how to interpret the .0.hic for low coverage so I have attached an image of an assembly that did assemble the X chromosome and one that could not. It would be greatly appreciated if you could help me interpret the X chromosome coverage. From my understanding, it looks like it may be in the debris because the contigs generated from W2rapContigger were below 15kbp and setting the -i parameter to 10kbp may help? It also looks like there is maybe still some chrX in the debris for the sample that could assemble some of chrX?

Sample Without ChrX

Thank you so much for all your help with this!

Best Wishes,
Thomas Malachowski

Olga Dudchenko

unread,
Jan 7, 2022, 11:13:20 AM1/7/22
to 3D Genomics
Hi Thomas,

I am not sure what one can say from these static images, but you certainly have more debris in those without chrx as you call them. It would make sense that much of those chromosomes would remain unassembled wtih 15x coverage, but I would imagine at least parts of X should make it. I'd be surprised they would be completely absent, with not a single contig, even from PAR, making it above 15kb.. In practice dropping to 10kb the threshold should help a bit, but the nature of the distribution is such that this will anchor additionally only a very small amount of sequence.

Best,
Olga

Reply all
Reply to author
Forward
0 new messages