3D-DNA Scaffolding: Out assembly is more fragmented that the input

1,218 views
Skip to first unread message

aur...@vt.edu

unread,
Aug 24, 2018, 12:41:25 PM8/24/18
to 3D Genomics
Dear 3D Genomic Group, 

  I am trying to scaffold our assembly (genome size ~900 Mb, assembly size 870 Mb with 3,204 sequences and a N50/L50 of 136 sequences/1.56Mb). We did a HiC lib. prep. and sequencing with PhaseGenomics, getting > 60X of pair end reads, of which > 45X mapped to the genome. The file merged_nodups.txt has 48M of mapped reads. We run Juicer and 3d-dna with the default parameters and the output was more fragmented that the input (13,127 sequences, with N50/L50 of 1102 sequences/0.14Mb). 

  In the first Hic heatmap (myassembly.0.hic) I can see as many clusters as chromosomes I am expecting, but in the subsequent rounds, the size of these clusters goes smaller and smaller and the "are moved to a no cluster zone".

  The Juicer output was as follows:
  • Library Complexity Estimate: 52,709,417
  • Intra-fragment Reads: 2,597,203 (1.37% / 5.42%)
  • Below MAPQ Threshold: 33,028,439 (17.47% / 68.88%)
  • Hi-C Contacts: 12,321,814 (6.52% / 25.70%)
  •  Ligation Motif Present: 3,593,834  (1.90% / 7.50%)
  •  3' Bias (Long Range): 64% - 36%
  •  Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%
  • Inter-chromosomal: 638,734  (0.34% / 1.33%)
  • Intra-chromosomal: 11,683,080  (6.18% / 24.37%)
  • Short Range (<20Kb): 7,293,881  (3.86% / 15.21%)
  • Long Range (>20Kb): 4,388,919  (2.32% / 9.15%)
  So we though that despite the high number of PCR duplicates, we still have enough reads to for the scaffolding. We also checked the coverage. ~ 22% of the genome is not covered by any mapped read. 44% of the genome is covered by 5 or more reads. 

  Reading the manual, I got the impression that a lot of contacts have been annotated as debris, but I do not know:
  1. What files should I look.
  2. What stats should I look in those files?
  3. What parameters should I change in order to improve the assembly.
  Thank you.

  Aureliano

Olga Dudchenko

unread,
Aug 24, 2018, 1:27:08 PM8/24/18
to 3D Genomics
Hi Aureliano,

This means that the default misjoin correction is not working for you. The most typical scenarios are 1) coverage biases, due to library preparation, sample or draft peculiarities like undercollapsed heterozygosity or presense of large gaps; 2) insufficient coverage; 3) library peculiarities that result in signal being of lower than expected density near diagonal, e.g. in HindIII was used.

You can analyze what is going on by looking at 1D track files for round 0:  the typically most informative ones are depletion_score_wide.at.step.0.wig, mismatch_wide.at.step.0.bed, repeats_wide.at.step.0.bed. See manual figure 5 (page 28 on how they are expected to look).

By the numbers you are showing I would guess that you may have too few reads which leads to the signal for the editor not being saturated - you loose a lot of pcr duplicates judging by the very low complexity estimate and a lot to mapping quality (the indicator for under-saturated misjoin detecdtor would be a weird looking depletion score wig track, see above). You can probably work with what you have esp. if you have large pieces in the draft, but you will need to modify the --editor parameters (e.g. --editor-saturation-centile). It also seems very concerning that 22% are not covered by any reads - this almost certainly means that the coverage analysis is off. Generally, you can tweak it by modifying --editor-repeat-coverage. It is hard to judge what this means not knowing what genome are are dealing with, but unless you know where is this coming from my suggestion would be, among other things, to check if these are alternative haplotypes. In this case you may need to go for diploid rather than just tweack the editor.

Good luck,
Olga

aur...@vt.edu

unread,
Aug 24, 2018, 2:46:39 PM8/24/18
to 3D Genomics
Dear Olga, 

   Thank you very much for detailed answer. We are working with a plant genome. We know that the genome has a low heterozygosity (0.16%). We do not have any gaps, it is a PacBio assembly. So I think that the problem may be related with the library preparation. In my communication with PhaseGenomics, they told me that "Our protocol is specifically designed to reduce the amount of split reads, which increases the amount of useful data but can mess with assumptions other tools make about the library". Could be this the cause?

  Also I would like to ask you how the program deals with PCR duplications? Does it keep 2-3 copies or a single unique copy? Is based in the position or position + variants?

  Thank you again, 

  Best regards

  Aureliano

Olga Dudchenko

unread,
Aug 24, 2018, 5:03:08 PM8/24/18
to 3D Genomics
Aureliano,

You mean how Juicebox filters them out? It performs alignment of both ends independently and if the alignments on both sides, again from two different reads point to the same position, with a bit of wiggle room, they are assumed to be pcr duplicates. Only one copy is left in mnd. Variants are not taken into account. 3D-DNA does not do any additional handling of PCR duplicates.

It would require some screenshots or data to draw any inference re library prep/sample peculiarities, but the numbers you list seem a bit baffling.. In any case, look carefully at the round 0 maps in JBAT and those tracks, they may be able to help with figuring out whats going on.

Best,
Olga

aur...@vt.edu

unread,
Aug 24, 2018, 6:47:52 PM8/24/18
to 3D Genomics
Dear Olga, 

  Thank you for your response. I have looked at the heatmap with the tracks that you mention, but I do not see nothing that pop-up except a big repeat. What values should I expect? I have attached several screenshoots with several levels of zoom. 

  Thank you very much, 

  Best regards

  Aureliano
mydata0.zoom4.HiCImage.pdf
mydata0.zoom1.HiCImage.pdf
mydata0.all.HiCImage.pdf

Olga Dudchenko

unread,
Aug 27, 2018, 11:23:28 AM8/27/18
to 3D Genomics
Aureliano,

You have very little data. The depletion score track is not saturated: see your dark-blue track and compare to the one in Fig. 5. Try playing around with --editor-saturation-centile though I am not even sure you can saturate it at the default resolution of 25kb. Perhaps you will also have to change --editor-coarse-resolution to some larger number (50000). Your goal is to make it saturated everywhere except for the misjoins along the diagonal. You do not have to run the full pipeline for this, just the 3d-dna/edit/run-mismatch-detector.sh 

Best,
Olga

Aureliano Bombarely

unread,
Aug 27, 2018, 12:16:11 PM8/27/18
to 3D Genomics
Dear Olga, 

  Thank you very much for your answer. I'll try to use the parameters that you suggested and see how goes. I already changed --editor-coarse-resolution with somewhat better values. I'll also talk with the company that made the libraries. There are some aspects of this library that are concerning (such as the high percentage of PCR duplications) compared with other HiC data that I worked before with no problems.

  Best regards

  Aureliano

Aureliano Bombarely

unread,
Aug 30, 2018, 7:33:16 AM8/30/18
to 3D Genomics
Dear Olga, 

  After several days playing with different parameters for the editor section. I got to the point that the data is not enough to scaffold the contigs into pseudochromosomes. We started with 390M of reads, from which 78M were PCR duplications, another 78M were from mitochondria and 60M from chloroplast, leaving us ~25M of reads to play with after the different layers of filtering. Juicebox shows the chromosomes in the depletion score, but the round 2, 3d-dna still splits the scaffolds. I have attached the last figures for reference. I run this with --editor-coarse-resolution 2500000 --editor-coarse-region 12500000 --editor-fine-resolution 100000. Changing --editor-saturation-centile from 2 to 20 may not have any effect. Should I try higher values?

  Best regards
 
  Aureliano
Peame_canu01arrow01.contigs.1.juicebox.pdf
Peame_canu01arrow01.contigs.0.juicebox.pdf

Olga Dudchenko

unread,
Aug 30, 2018, 5:24:29 PM8/30/18
to 3D Genomics
Hi Aureliano,

I believe that your conclusion about lack of data is absolutely correct.

Great work saturating the depletion track despite the data limitation. Is 2500000 the lowest resolution at which you can saturate things? Judging by the tracks what you want to do now is to increase the --editor-repeat-coverage, the depletion score parameters you have handle the low data (except for of course you will not be able to detect misassemblies at fine scale, but given lack of data you might not have the power to do that anyway).

Note that if the data only allows you to do misjoins detection at this very large scale, you might just try going directly for polishing which is in essence exactly that: error-correction at a large scale.

Hope this helps,
Olga

zhua...@gmail.com

unread,
Oct 18, 2018, 2:03:52 AM10/18/18
to 3D Genomics
How do you carry out the coverage analysis ?

在 2018年8月25日星期六 UTC+8上午12:41:25,aur...@vt.edu写道:

Olga Dudchenko

unread,
Oct 18, 2018, 6:58:02 AM10/18/18
to 3D Genomics
Aureliano,

You can load the coverage track (1d-annotations -> basic annotations -> coverage) in jucebox/jbat. You can also run something like the script attached for bulk view.

Best,
Olga
plot_coverage.awk
Reply all
Reply to author
Forward
0 new messages