The final assembly has many scaffolds with low quality

461 views
Skip to first unread message

karim karimi

unread,
Apr 10, 2021, 12:36:35 PM4/10/21
to 3D Genomics

Hi 3D Genomics Group,

We generated a genome assembly using Pacbio HiFi reads. The assembly has 291 contigs and is of high quality with N50 of 30 Mb and BUSCO of 96%. After applying Hi-C data in 3D-DNA (defaults), the number of contigs increased and the N50 was decreased. When I loaded the first round (hifiasm_ccs.p_ctg.0.hic) to Juicebox, the  chromosomes pattern was clear but  finally all big contigs were classified as one chromosome while there are numerous small chromosomes beside them. I reviewed other similar posts here, but I could not figure out what type of parameter adjustment is required for my data. I also performed an analysis using --editor-repeat-coverage of 20, but the results were not promising.

1.      Is there any problem related to coverage? Are these values very low (the values are in range of 0 to 1.5)?

2.      Is it essential to work on --editor-coarse-resolution and/or --editor-saturation-centile parameters? Which measures can guide me to decide about them? 

I would appreciate if you could help me to handle these issues. 

hifiasm_0_3d-dna.hic.0.pdf

karim karimi

unread,
Apr 10, 2021, 12:39:23 PM4/10/21
to 3D Genomics

Here are the stats of our library (inter.txt of juicer):

Sequenced Read Pairs:  22,500,000

 Normal Paired: 8,887,459 (39.50%)

 Chimeric Paired: 10,317,221 (45.85%)

 Chimeric Ambiguous: 2,936,537 (13.05%)

 Unmapped: 358,783 (1.59%)

 Ligation Motif Present: 0 (0.00%)

Alignable (Normal+Chimeric Paired): 19,204,680 (85.35%)

Unique Reads: 17,172,328 (76.32%)

PCR Duplicates: 1,786,376 (7.94%)

Optical Duplicates: 245,976 (1.09%)

Library Complexity Estimate: 94,179,673

Intra-fragment Reads: 0 (0.00% / 0.00%)

Below MAPQ Threshold: 2,683,947 (11.93% / 15.63%)

Hi-C Contacts: 14,488,381 (64.39% / 84.37%)

 Ligation Motif Present: 0  (0.00% / 0.00%)

 3' Bias (Long Range): 0% - 0%

 Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%

Inter-chromosomal: 7,259,222  (32.26% / 42.27%)

Intra-chromosomal: 7,229,159  (32.13% / 42.10%)

Short Range (<20Kb): 3,597,304  (15.99% / 20.95%)

Long Range (>20Kb): 3,630,029  (16.13% / 21.14%) 

Olga Dudchenko

unread,
Apr 23, 2021, 11:57:53 AM4/23/21
to 3D Genomics
Hi Karim,

You just seem to have very little data. 3.6M reads to scaffold a 3Gb genome, that's a bit low for default parameters. I don't see your repeat bed track, but look at the mismatch track: the data is so sparse near the diagonal that everything gets flagged. It's fine with this coverage given that you have relatively large scaffolds, but you'll have to either skip error correction altogether (-r 0) or run it at much lower resolution that the default (25kb).

Best,
Olga

karim karimi

unread,
Apr 25, 2021, 11:59:31 PM4/25/21
to 3D Genomics

Hi Olga,

Thank you for your response. As you suggested, I followed two scenarios: 1) I used -r 0 to run the pipeline, but the results were not satisfying and there were not suitable patterns. 2) I changed the resolution parameter to --editor-coarse-resolution 1000,000 --editor-coarse-region 5,000,000 and --splitter-coarse-resolution 1000,000. In this case, the final chromosome pattern seems reasonable and there were almost 15 scaffolds (chromosomes). However, there are still many small scaffolds in the left and down side of figure that could not be assigned to a certain chromosome. The linear smears in the front of each chromosome indicate that these scaffolds belong to corresponding chromosomes. It would be possible to correct some of them using Juicebox but the number of them is high and it is not possible to handle this accurately. I am a bit confused to choose a right decision:

1)    I could ignore and discard all these small scaffolds? If so, we will lose some parts of main assembly

2)    It is possible to handle this in JuiceBox and I should take time on these? Many scaffolds are so small without any strong signal.

3)    We require to order and add another library of Hi-C data? How many paired reads would require as a regular value for a genome size of 2.68 Gb.

4)    There are other parameter settings that I could try them to correct final map?

Another point, when I use the assembly form juicebox to run-asm-pipeline-post-review.sh, the speed of generating .fasta file is quite slow:

Bash run-asm-pipeline-post-review.sh --sort-output -s seal -i 500 -r  hifiasm_ccs.p_ctg.0.review.assembly hifiasm_ccs.p_ctg.fasta  merged_nodups.txt

How can I achieve the final fasta file.

Best regards,

Karim


hifiasm_ccs.p_ctg.rawchro.pdf
hifiasm_chr_3dna_res_2.5m_sat50.PNG

Olga Dudchenko

unread,
Apr 29, 2021, 6:08:22 PM4/29/21
to 3D Genomics
1) the assembly fasta as generated contains all sequences, not just the anchored. So you will not loose sequence, but it will be unanchored. Since you clearly see you can anchor them seems like not the best solution to me.

2) Yes, of course you can handle things in Juicebox Assembly Tools. You want to anchor large things that you are confindent about. The goal is not to anchor absolutely everything. There will almost always be some unanchored sequences. In fact, by default 3d-dna will not try to anchor sequences smaller than 15kb for that precise reason, that at expected coverage the signal is sparse enough for small contigs that you start making mistakes when ordering and orienting etc.

3) The coverage requirements depend on your draft N50, if the assembly is all you care about. DNA Zoo as a standard sequences to about 30X contacts because we also want to look at 3D signal which is hard to do with data as sparse as yours. Note that given how sparse your data is it is unreasonable for you to pass -i 500. Those contigs wont' have any reads as a rule with your coverage.

4) I cannot recommend anything regarding the setting since I do not know the reason why those sequences were removed in the first place. But there are tracks generated by 3d-dna that will tell you just that. I encourage you to read either the cookbook or the supplement to Dudchenko et al., 2017 which describes in detail how something gets annotated and removed, and look at the tracks associated with each individual step (those bed files that tell you what gets flagged as misjoin, what gets flagged as repeat etc.).

If your spead is slow check that your fasta is unwrapped (each sequence is represented with one huge line). If not wrapped run utils/wrap-fasta.awk (this will create a fasta that's the same but split across multiple lines which is better for compute) and pass that.

Best,
Olga  
Reply all
Reply to author
Forward
0 new messages