Dear 3D Genomic Group,
I am trying to scaffold our assembly (genome size ~900 Mb, assembly size 870 Mb with 3,204 sequences and a N50/L50 of 136 sequences/1.56Mb). We did a HiC lib. prep. and sequencing with PhaseGenomics, getting > 60X of pair end reads, of which > 45X mapped to the genome. The file merged_nodups.txt has 48M of mapped reads. We run Juicer and 3d-dna with the default parameters and the output was more fragmented that the input (13,127 sequences, with N50/L50 of 1102 sequences/0.14Mb).
In the first Hic heatmap (myassembly.0.hic) I can see as many clusters as chromosomes I am expecting, but in the subsequent rounds, the size of these clusters goes smaller and smaller and the "are moved to a no cluster zone".
The Juicer output was as follows:
- Library Complexity Estimate: 52,709,417
- Intra-fragment Reads: 2,597,203 (1.37% / 5.42%)
- Below MAPQ Threshold: 33,028,439 (17.47% / 68.88%)
- Hi-C Contacts: 12,321,814 (6.52% / 25.70%)
- Ligation Motif Present: 3,593,834 (1.90% / 7.50%)
- 3' Bias (Long Range): 64% - 36%
- Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%
- Inter-chromosomal: 638,734 (0.34% / 1.33%)
- Intra-chromosomal: 11,683,080 (6.18% / 24.37%)
- Short Range (<20Kb): 7,293,881 (3.86% / 15.21%)
- Long Range (>20Kb): 4,388,919 (2.32% / 9.15%)
So we though that despite the high number of PCR duplicates, we still have enough reads to for the scaffolding. We also checked the coverage. ~ 22% of the genome is not covered by any mapped read. 44% of the genome is covered by 5 or more reads.
Reading the manual, I got the impression that a lot of contacts have been annotated as debris, but I do not know:
- What files should I look.
- What stats should I look in those files?
- What parameters should I change in order to improve the assembly.
Thank you.
Aureliano