Recently attempted to run the juicer pipeline on a an arima kit based HiC dataset. Because arima uses multiple restriction enzymes, I decided to try to run juicer in DNAse mode using "-s none” flag as this allowed me to at least try to test the pipeline without having to implement a new "generate_site_positions.py" script. I don't have any experience with juicer so I can't really evaluate how reasonable the output is but it seems quite lackluster. I've posted the stats for inter_30 below. I also got output from arrowhead and hiccups.
...
Sequenced Read Pairs: 79,356,788
Normal Paired: 27,920,899 (35.18%)
Chimeric Paired: 28,615,641 (36.06%)
Chimeric Ambiguous: 20,950,219 (26.40%)
Unmapped: 1,870,029 (2.36%)
Ligation Motif Present: 0 (0.00%)
Alignable (Normal+Chimeric Paired): 56,536,540 (71.24%)
Unique Reads: 53,079,577 (66.89%)
PCR Duplicates: 3,424,642 (4.32%)
Optical Duplicates: 32,321 (0.04%)
Library Complexity Estimate: 447,108,814
Intra-fragment Reads: 0 (0.00% / 0.00%)
Below MAPQ Threshold: 30,344,444 (38.24% / 57.17%)
Hi-C Contacts: 22,735,133 (28.65% / 42.83%)
Ligation Motif Present: 0 (0.00% / 0.00%)
3' Bias (Long Range): 0% - 0%
Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%
Inter-chromosomal: 13,877,268 (17.49% / 26.14%)
Intra-chromosomal: 8,857,865 (11.16% / 16.69%)
Short Range (<20Kb): 6,812,548 (8.58% / 12.83%)
Long Range (>20Kb): 2,038,770 (2.57% / 3.84%)
...I can think of multiple possible reasons why my output was so lackluster and am looking for advice on what to pursue first or other directions to look in.....
1. I'm running without restriction site positions which may be degrading significance in some sense.
2. I'm running on a draft assembly. Given hiccups/arrowheads method of operation, it may perform better on scaffolded assemblies. Perhaps I should use 3dDNA to generate such an assembly and then attempt to run the juicer pipeline on the resultant assembly?
3. I have about half the typical number of meaningful reads. The aidenlab assembly cookbook recommends 100 million reads. I only have 79 million of which only 53 million are apparently meaningful. Perhaps I simply lack a large enough dataset to have significance?
...It may also be worth noting that I'm running the cpu version of all juicer tools.