Hi 3D Genomics Group,
We generated a genome assembly using Pacbio HiFi reads. The assembly has 291 contigs and is of high quality with N50 of 30 Mb and BUSCO of 96%. After applying Hi-C data in 3D-DNA (defaults), the number of contigs increased and the N50 was decreased. When I loaded the first round (hifiasm_ccs.p_ctg.0.hic) to Juicebox, the chromosomes pattern was clear but finally all big contigs were classified as one chromosome while there are numerous small chromosomes beside them. I reviewed other similar posts here, but I could not figure out what type of parameter adjustment is required for my data. I also performed an analysis using --editor-repeat-coverage of 20, but the results were not promising.
1. Is there any problem related to coverage? Are these values very low (the values are in range of 0 to 1.5)?
2. Is it essential to work on --editor-coarse-resolution and/or --editor-saturation-centile parameters? Which measures can guide me to decide about them?
I would appreciate if you could help me to handle these issues.
Here are the stats of our library (inter.txt of juicer):
Sequenced Read Pairs: 22,500,000
Normal Paired: 8,887,459 (39.50%)
Chimeric Paired: 10,317,221 (45.85%)
Chimeric Ambiguous: 2,936,537 (13.05%)
Unmapped: 358,783 (1.59%)
Ligation Motif Present: 0 (0.00%)
Alignable (Normal+Chimeric Paired): 19,204,680 (85.35%)
Unique Reads: 17,172,328 (76.32%)
PCR Duplicates: 1,786,376 (7.94%)
Optical Duplicates: 245,976 (1.09%)
Library Complexity Estimate: 94,179,673
Intra-fragment Reads: 0 (0.00% / 0.00%)
Below MAPQ Threshold: 2,683,947 (11.93% / 15.63%)
Hi-C Contacts: 14,488,381 (64.39% / 84.37%)
Ligation Motif Present: 0 (0.00% / 0.00%)
3' Bias (Long Range): 0% - 0%
Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%
Inter-chromosomal: 7,259,222 (32.26% / 42.27%)
Intra-chromosomal: 7,229,159 (32.13% / 42.10%)
Short Range (<20Kb): 3,597,304 (15.99% / 20.95%)
Long Range (>20Kb): 3,630,029 (16.13% / 21.14%)Hi Olga,
Thank you for your response. As you suggested, I followed two scenarios: 1) I used -r 0 to run the pipeline, but the results were not satisfying and there were not suitable patterns. 2) I changed the resolution parameter to --editor-coarse-resolution 1000,000 --editor-coarse-region 5,000,000 and --splitter-coarse-resolution 1000,000. In this case, the final chromosome pattern seems reasonable and there were almost 15 scaffolds (chromosomes). However, there are still many small scaffolds in the left and down side of figure that could not be assigned to a certain chromosome. The linear smears in the front of each chromosome indicate that these scaffolds belong to corresponding chromosomes. It would be possible to correct some of them using Juicebox but the number of them is high and it is not possible to handle this accurately. I am a bit confused to choose a right decision:
1) I could ignore and discard all these small scaffolds? If so, we will lose some parts of main assembly
2) It is possible to handle this in JuiceBox and I should take time on these? Many scaffolds are so small without any strong signal.
3) We require to order and add another library of Hi-C data? How many paired reads would require as a regular value for a genome size of 2.68 Gb.
4) There are other parameter settings that I could try them to correct final map?
Another point, when I use the assembly form juicebox to run-asm-pipeline-post-review.sh, the speed of generating .fasta file is quite slow:
Bash run-asm-pipeline-post-review.sh --sort-output -s seal -i 500 -r hifiasm_ccs.p_ctg.0.review.assembly hifiasm_ccs.p_ctg.fasta merged_nodups.txt
How can I achieve the final fasta file.
Best regards,
Karim