Juicer Pre Run Time - 9 days and counting. Do I kill it?

Elizabeth Scholl

unread,

Nov 28, 2018, 2:49:41 PM11/28/18

to 3D Genomics

Hello -

I've been running juicer pre for 9 days now, and I'm wondering what "normal run time" would be with my set of data.

I've got a deNovo Pac Bio assembly for my reference which is in 9,205 pieces. The goal was to see if adding in a lane for HiC would give us a better assembly.

I'm running on a GPU, using 16 cores, and the protocol I was looking at said to run pre with a -q 1 and then follow-up with a -q 30.

I'm still running the -q 1 side of things.

Because it's been 9 days, I'm a bit reluctant to kill the job without checking in here first ;)

merged_nodups.txt.gz is 8.0G

inter.hic is currently at 1.7G and was written to less than a minute ago (as was my SLURM output)

My inter.txt file gives the following stats:

Sequenced Read Pairs: 145,477,454

Normal Paired: 128,307,544 (88.20%)

Chimeric Paired: 3,926,707 (2.70%)

Chimeric Ambiguous: 9,924,800 (6.82%)

Unmapped: 3,318,403 (2.28%)

Ligation Motif Present: 5,990,072 (4.12%)

Alignable (Normal+Chimeric Paired): 132,234,251 (90.90%)

Unique Reads: 119,080,348 (81.85%)

PCR Duplicates: 13,012,416 (8.94%)

Optical Duplicates: 141,487 (0.10%)

Library Complexity Estimate: 625,661,246

Intra-fragment Reads: 9,385,779 (6.45% / 7.88%)

Below MAPQ Threshold: 101,281,196 (69.62% / 85.05%)

Hi-C Contacts: 8,413,373 (5.78% / 7.07%)

Ligation Motif Present: 684,188 (0.47% / 0.57%)

3' Bias (Long Range): 54% - 46%

Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%

Inter-chromosomal: 3,698,383 (2.54% / 3.11%)

Intra-chromosomal: 4,714,990 (3.24% / 3.96%)

Short Range (<20Kb): 4,298,451 (2.95% / 3.61%)

Long Range (>20Kb): 414,058 (0.28% / 0.35%)

So - do I kill the job and restart it, running the -q 30 first (since I'll likely be using that for the next steps) ?
Do I follow some of what was recommended to cut down run-time here:
https://groups.google.com/forum/embed/?place=forum/3d-genomics&showsearch=true&showpopout=true&parenturl=http%3A%2F%2Faidenlab.org%2Fforum.html#!searchin/3d-genomics/pre$20run$20time/3d-genomics/dgqNM32cEmQ/P5DjEyETCgAJ

Or does the thought of my killing a job after 9 days make you cringe?

Any advice is very much welcome.

Thanks!

(P.S. I only have easy access to the one GPU node, so starting the quality-30 cut-off run while leaving the other running isn't realistic right now :( )

Neva Durand

unread,

Nov 28, 2018, 2:56:32 PM11/28/18

to Elizabeth Scholl, 3d-ge...@googlegroups.com

I'm sorry to hear about your problems! And with only 145M reads I don't see any reason why Pre would take so long - on our machines it would take much less than a day, perhaps under an hour without fragment resolutions. I would certainly follow the steps Muhammad recommended in that post. You might also check with your IT people that you're getting appropriate priority (i.e. that there aren't other processes competing for your run time) and that there's not a disk issue.

I would kill and run Juicer Tools again with more RAM (though I would think 8g would be sufficient) and with the "-x" flag.

Also if you don't need HiCCUPs (and you don't have enough reads for HiCCUPs) then you don't need GPUs.

One final note - if your Hi-C data is PacBio, you might consider changing the internal aligner in Juicer from bwa to minimap2.

Best

Neva

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/69fd8461-7ca7-44d8-8e0a-39be587f48c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Neva Cherniavsky Durand, Ph.D.

Staff Scientist, Aiden Lab

www.aidenlab.org

Elizabeth Scholl

unread,

Nov 28, 2018, 3:23:43 PM11/28/18

to Neva Durand, 3d-ge...@googlegroups.com

Thank you, Neva - Wish I had written to you on Monday ;)

I'm the only user on the GPU and I sent the command through with -Xms49152m -Xmx49152m

so yeah, I'll be killing it now.

The PacBio is my "DNASeq" data that created the draft assembly. The HiC data is Illumina. Sorry for that confusion.

My end-goal here is to be able to get the hic file to import to Juicebox and see if I can't order and orient the scaffolds I have to get a better assembly.

Do you think I have hope in that regard?

Thanks again! Always impressed at the speed of reply from you guys!

Betsy

--

Dr. Elizabeth H. Scholl (Betsy)
Staff Director and Research Scholar
Bioinformatics Consulting and Service Core

Statistical Consulting Core
357B Ricks Hall, NCSU Campus
Raleigh, NC 27695-7614
(919) 515-7655

(Email is the best way to get in touch with me)

Olga Dudchenko

unread,

Nov 28, 2018, 3:49:08 PM11/28/18

to 3D Genomics

Elizabeth,

You should not be running pre on the draft at all. Please see how to visualize draft genome assemblies on page 5 of the genome assembly cookbook: http://aidenlab.org/assembly/manual_180322.pdf

The result will be the map of the draft compatible with Juicebox Assembly Tools if you want to do manual ordering and orientation. To order and orient scaffolds using 3d-dna you only need the merged_nodups.txt file which is created earlier in the Juicer workflow. 3d-dna will make hic files as part of the workflow.

Best,

Olga

Elizabeth Scholl

unread,

Nov 28, 2018, 4:18:46 PM11/28/18

to odudc...@icloud.com, 3d-ge...@googlegroups.com

Hi Olga -

I was JUST reading that! Thank you!

To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/ee4d343a-e8a5-4615-9eed-618abad19211%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Zhiguang Huo

unread,

Dec 3, 2018, 5:07:15 PM12/3/18

to 3D Genomics

Hi Neva Durand:

I encountered the same problem as Elizabeth Scholl mentioned.

The Pre step takes for ever.

The reason is "awk -f ${juiceDir}/scripts/common/dups.awk -v name=${outputdir}/ ${outputdir}/merged_sort.txt" will take forever, when processing chr2.

I solved the problem by replacing this line by

cat ${outputdir}/merged_sort.txt | awk '!x[$1$2$3$4$5$6$7$8]++' > ${outputdir}/merged_nodups.txt,

which remove the duplicates very fast.

Hope this is correct.

Neva Durand

unread,

Dec 3, 2018, 5:14:16 PM12/3/18

to Zhiguang Huo, 3D Genomics

This is a different problem. Elizabeth specifically said that Pre was taking a long time.

Your duplicate code will only remove exact duplicates. We remove exact and near duplicates. Please see the supplemental materials of the Cell 2014 paper to learn more, or search this forum for past threads about “wobble”.

By the way that awk script will add an entry to the x hash table for every line, so with a large merged_sort file this will take as much memory as the size of the merged_sort file.

To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/9fb1a95b-455f-4aca-8496-1e71ac33af19%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Zhiguang Huo

unread,

Dec 3, 2018, 7:41:16 PM12/3/18

to 3D Genomics

Hi Neva:

Thanks for your quick reply.

You are right. My problem (dedup) is before the Pre step, they are different problems.

That awk script (mine) only takes 8 columns of the merged_sort, but still will take a large chunk of memory.

And it doesn't consider wobble.

Reply all

Reply to author

Forward