Arima based Hi-C results

347 views
Skip to first unread message

Nolan Hartwick

unread,
May 29, 2019, 4:25:48 PM5/29/19
to 3D Genomics
Recently attempted to run the juicer pipeline on a an arima kit based HiC dataset. Because arima uses multiple restriction enzymes, I decided to try to run juicer in DNAse mode using  "-s none” flag as this allowed me to at least try to test the pipeline without having to implement a new "generate_site_positions.py" script. I don't have any experience with juicer so I can't really evaluate how reasonable the output is but it seems quite lackluster. I've posted the stats for inter_30 below. I also got output from arrowhead and hiccups. Arrowhead only identified 3 sites and hiccups identified zero...


Sequenced Read Pairs:  79,356,788
 Normal Paired: 27,920,899 (35.18%)
 Chimeric Paired: 28,615,641 (36.06%)
 Chimeric Ambiguous: 20,950,219 (26.40%)
 Unmapped: 1,870,029 (2.36%)
 Ligation Motif Present: 0 (0.00%)
Alignable (Normal+Chimeric Paired): 56,536,540 (71.24%)
Unique Reads: 53,079,577 (66.89%)
PCR Duplicates: 3,424,642 (4.32%)
Optical Duplicates: 32,321 (0.04%)
Library Complexity Estimate: 447,108,814
Intra-fragment Reads: 0 (0.00% / 0.00%)
Below MAPQ Threshold: 30,344,444 (38.24% / 57.17%)
Hi-C Contacts: 22,735,133 (28.65% / 42.83%)
 Ligation Motif Present: 0  (0.00% / 0.00%)
 3' Bias (Long Range): 0% - 0%
 Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%
Inter-chromosomal: 13,877,268  (17.49% / 26.14%)
Intra-chromosomal: 8,857,865  (11.16% / 16.69%)
Short Range (<20Kb): 6,812,548  (8.58% / 12.83%)
Long Range (>20Kb): 2,038,770  (2.57% / 3.84%)


...I can think of multiple possible reasons why my output was so lackluster and am looking for advice on what to pursue first or other directions to look in.....

1. I'm running without restriction site positions which may be degrading significance in some sense.

2. I'm running on a draft assembly. Given hiccups/arrowheads method of operation, it may perform better on scaffolded assemblies. Perhaps I should use 3dDNA to generate such an assembly and then attempt to run the juicer pipeline on the resultant assembly?

3. I have about half the typical number of meaningful reads. The aidenlab assembly cookbook recommends 100 million reads. I only have 79 million of which only 53 million are apparently meaningful. Perhaps I simply lack a large enough dataset to have significance?

...It may also be worth noting that I'm running the cpu version of all juicer tools.

Olga Dudchenko

unread,
May 30, 2019, 8:22:34 AM5/30/19
to 3D Genomics
Nolan,

What is your concern specifically? You statistics does not strike me as particularly unusual or unexpected. See paragraph #3 in cookbook for discussion of the stats (p.15). Re arrowhead or hiccups, it is pretty meaningless to try to run on draft genome assembly. Re number of reads please note that 100M is an example recommendation to assemble a mammal end-to-end in a $1k model. Is your sample a mammal (I suspect not given your mapq)? Is your genome larger or smaller? What is the state of your draft? (Further along drafts will require less coverage to assemble than highly fragmented ones.) Note also that these are recommendations for assembly and not for feature annotations. If you are interested to run arrowhead or hiccups, you will need much more coverage (again, not to mention a proper chromosome-length genome).

Best,
Olga
Reply all
Reply to author
Forward
0 new messages