hic problems running Juicer on SGE

Remi Coux

unread,

Sep 21, 2018, 8:49:10 AM9/21/18

to 3D Genomics

Hi Juicer gang, first thanks a lot for developing and maintaining Juicer.

I'm trying to run Juicer on an SGE cluster thus tweaked the script a little bit (see attached), however I'm facing the following problem that ressembles the previously reported stats bug for low complexity data but I believe it starts during generation of the hic file:

The run doesn't complete and two jobs remain queued and are not listed in the jobs.out, I get a very small inter_30.hic file (16M), the dups, merged_nodups and merged_sorted are respectively 10, 17 and 28G.

I get the following hic30***.err

Picked up _JAVA_OPTIONS: -Xmx32g

Error while reading graphs file: java.io.FileNotFoundException: /ifs/data/lehmannlab/couxr01/aligned/inter_30_hists.m (No such file or directory)

java.lang.NumberFormatException: For input string: "chrUn_DS484226v1"

at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

at java.lang.Integer.parseInt(Integer.java:580)

at java.lang.Integer.parseInt(Integer.java:615)

at juicebox.tools.utils.original.AsciiPairIterator.advance(AsciiPairIterator.java:203)

at juicebox.tools.utils.original.AsciiPairIterator.next(AsciiPairIterator.java:247)

at juicebox.tools.utils.original.Preprocessor.computeWholeGenomeMatrix(Preprocessor.java:498)

at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:376)

at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:286)

at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:105)

at juicebox.tools.HiCTools.main(HiCTools.java:98)

here's the the stats30**.err

id: cannot find name for user ID 1877

id: cannot find name for group ID 1877

id: cannot find name for user ID 1877

/cm/local/apps/environment-modules/3.2.10/Modules/3.2.10/bin/modulecmd: error while loading shared libraries: libc.so.6: failed to map segment from shared object: Cannot allocate memory

/cm/local/apps/sge/var/spool/node026/job_scripts/5671321: line 1: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

My data shouldn't be that low complexity: my fastq files contain ~500M reads per sample, 95% of which >=MAPQ30 (per illumina flowcell stats and HiCUP report). However, the inter.txt file only gives ~143M read pairs, but none of the align reports show errors. None of the other report files contain errors, the chimeric and count_ligations are all empty (I don't know if they should be).

Experiment description: Juicer version 1.5.6; BWA 0.7.7-r441; 1 threads; splitsize 45000000; openjdk version "1.8.0_144"; Juicer Tools Version 1.7.6; /ifs/data/lehmannlab/couxr01/juicer/scripts/juicer.sh -g dm6 -s DpnII -z /ifs/data/lehmannlab/couxr01/juicer/reference/dm6.fa -p /ifs/data/lehmannlab/couxr01/juicer/restriction_sites/dm6.chrom.sizes -y /ifs/data/lehmannlab/couxr01/juicer/restriction_sites/dm6_DpnII.txt -D /ifs/data/lehmannlab/couxr01/juicer

Sequenced Read Pairs: 143,139,799

Normal Paired: 111,560,517 (77.94%)

Chimeric Paired: 36,111 (0.03%)

Chimeric Ambiguous: 12,364,736 (8.64%)

Unmapped: 19,178,408 (13.40%)

Ligation Motif Present: 147,987,325 (103.39%)

Alignable (Normal+Chimeric Paired): 111,596,628 (77.96%)

Unique Reads: 68,465,944 (47.83%)

PCR Duplicates: 41,651,617 (29.10%)

Optical Duplicates: 282,534 (0.20%)

Library Complexity Estimate: 105,911,879

Intra-fragment Reads: 18,678,354 (13.05% / 27.28%)

Below MAPQ Threshold: 26,037,549 (18.19% / 38.03%)

Hi-C Contacts: 23,750,041 (16.59% / 34.69%)

Ligation Motif Present: 11,216,925 (7.84% / 16.38%)

3' Bias (Long Range): 76% - 24%

Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%

Inter-chromosomal: 2,812,642 (1.96% / 4.11%)

Intra-chromosomal: 20,937,399 (14.63% / 30.58%)

Short Range (<20Kb): 16,860,143 (11.78% / 24.63%)

Long Range (>20Kb): 4,077,037 (2.85% / 5.95%)

Could you please let me know if this is the stats bug? Or if you have any idea on what could be causing this - I don't really know where to look for more detailed reports.

Thanks in advance

Remi

juicer.sh

Capture d’écran 2018-09-21 à 14.37.23.png

Muhammad Saad Shamim

unread,

Sep 21, 2018, 11:10:03 AM9/21/18

to rx....@gmail.com, 3D Genomics

Hey Remi,

Are you running juicer.sh on all the samples in one fastq folder?

Each Hi-C library should be processed as separate runs, and later combined into megamaps as appropriate.

Is the data above for just one library or the aggregation of all of them?

Some of this also looks like an error on the cluster, which would require contacting your cluster administrator (looks like permissions for your account?).

id: cannot find name for user ID 1877

id: cannot find name for group ID 1877

id: cannot find name for user ID 1877

https://unix.stackexchange.com/questions/54953/error-message-id-cannot-find-name-for-group-id-after-logging-in

That may be why the stats jobs don't reflect an accurate count.

Also, can you share your changes to juicer.sh as a fork of github.com/theaidenlab/juicer on GitHub so that we can track/compare changes?

Best,

- Muhammad Saad Shamim

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/2e481208-593c-46e3-b86e-92336ca65472%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Remi Coux

unread,

Sep 21, 2018, 11:40:54 AM9/21/18

to 3D Genomics

Thanks for the prompt reply, I'm contacting mu cluster's admins, will update if they manage to fix the problem.

I'm running samples one by one and plan on merging them using the mega script afterwards, thanks.

I created the forked as requested, you can find it here https://github.com/RXCoux/juicer/blob/master/UGER/scripts/juicer_SGE.sh

Have a great day

Remi

Remi Coux

unread,

Sep 27, 2018, 3:51:33 PM9/27/18

to 3D Genomics

Hi, the cluster admins replied that it was likely due to a RAM shortage on one of the nodes, I increased the hic and stats jobs' requested memory and juicer ran without a single error. However, I still get a small hic file (170mb), few domains called (200 vs ~500 published). The number of mapped read pairs in the inter.txt file is roughly 140M vs 190M given by HiCUP. Interestingly, two jobs remain queued but are not listed in debug/jobs*.out

These experiments were done on Drosophila that has a very compact genome, I know that settings need to be modified to run other HiC packages, but I don't really see what I could tune in here.

I updated the https://github.com/RXCoux/juicer/blob/master/UGER/scripts/juicer_SGE.sh

Would you guys have any suggestion/idea/comments please?

Thanks in advance

Remi

Muhammad Saad Shamim

unread,

Sep 27, 2018, 4:38:47 PM9/27/18

to rx....@gmail.com, 3D Genomics

Hey Remi,

Which jobs remain queued?

Is it 140M read pars or Hi-C contacts?

Which map are you referring to with the 500 domains / how many reads did they have?

Best,

- Muhammad Saad Shamim

To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/6d171af9-9375-447e-9592-e976ac2bbbb4%40googlegroups.com.

Remi Coux

unread,

Sep 27, 2018, 5:20:32 PM9/27/18

to Muhammad Saad Shamim, 3D Genomics

Hi Muhammad, thanks for the prompt reply,

The two jobs that remain queued (I've let them for > 48h in the past and they never completed) are a1538058429_hic (I get an inter_30.hic file but not inter.hic) and a1538058429_prep_finalize (see attached). Weirdly they do not appear in the jobs.out list.

I also attach the inter.txt stats that show Sequenced Read Pairs: 140,742,562, are they valid pairs? In comparison, HiCUP (report attached as well) gives 186,546,648 total read pairs but 147,622,223 valid pairs.

REgarding the domains, papers such as

PMC5389536

found ~ 500 TADs (and many more A/B compartments) but with ~ 1B reads and 600k contacts.

I will combine my replicates (I have 2 per conditions, roughly 300M sequenced pairs) and will report back.

What could explain such a small hic file?

--

RX Coux

5685788.txt

WT_DpnII_rep2_R1_2.HiCUP_summary_report.html

inter.txt

5685792.txt

Remi Coux

unread,

Oct 4, 2018, 5:35:35 PM10/4/18

to Muhammad Saad Shamim, 3D Genomics

Hi, I had a min_vmem option set super high for the hic which left it queued, my bad. I changed it and the script ran completely, I'm not getting 400-500M hic files and ~ 450 - 500 domains are called which is reasonable.

I however played with the splits number and obtained very different results:

Experiment description: Juicer version 1.5.6; BWA 0.7.7-r441; 1 threads; splitsize 45000000; openjdk version "1.8.0_144"; Juicer Tools Version 1.7.6; /ifs/data/lehmannlab/couxr01/juicer/scripts/juicer.sh -g dm6 -s DpnII -z /ifs/data/lehmannlab/couxr01/juicer/references/dm6.fa -p /ifs/data/lehmannlab/couxr01/juicer/restriction_sites/dm6.chrom.sizes -y /ifs/data/lehmannlab/couxr01/juicer/restriction_sites/dm6_DpnII.txt

Sequenced Read Pairs: 325,032,995

Normal Paired: 250,489,698 (77.07%)

Chimeric Paired: 24,535 (0.01%)

Chimeric Ambiguous: 17,007,191 (5.23%)

Unmapped: 57,511,545 (17.69%)

Ligation Motif Present: 270,587,500 (83.25%)

Alignable (Normal+Chimeric Paired): 250,514,233 (77.07%)

Unique Reads: 142,892,308 (43.96%)

PCR Duplicates: 106,909,089 (32.89%)

Optical Duplicates: 712,836 (0.22%)

Library Complexity Estimate: 200,706,330

Intra-fragment Reads: 6,623,902 (2.04% / 4.64%)

Below MAPQ Threshold: 77,068,008 (23.71% / 53.93%)

Hi-C Contacts: 59,200,398 (18.21% / 41.43%)

Ligation Motif Present: 24,759,348 (7.62% / 17.33%)

3' Bias (Long Range): 82% - 18%

Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%

Inter-chromosomal: 14,519,672 (4.47% / 10.16%)

Intra-chromosomal: 44,680,726 (13.75% / 31.27%)

Short Range (<20Kb): 20,582,814 (6.33% / 14.40%)

Long Range (>20Kb): 24,097,713 (7.41% / 16.86%)

543 domains called, finalcheck.out has the following error message

***! Error! The statistics do not add up. Alignment likely failed to complete on one or more files. Run relaunch_prep.sh

Stats don't add up. Check /ifs/data/lehmannlab/couxr01/aligned for results (no other error messages)

If I change the split size to 25000000 (same fastq files)

Experiment description: Juicer version 1.5.6; BWA 0.7.7-r441; 1 threads; splitsize 22500000; openjdk version "1.8.0_144"; Juicer Tools Version 1.7.6; /ifs/data/lehmannlab/couxr01/2/juicer/scripts/juicer.sh -g dm6 -s DpnII -z /ifs/data/lehmannlab/couxr01/2/juicer/references/dm6.fa -p /ifs/data/lehmannlab/couxr01/2/juicer/restriction_sites/dm6.chrom.sizes -y /ifs/data/lehmannlab/couxr01/2/juicer/restriction_sites/dm6_DpnII.txt -C 22500000

Sequenced Read Pairs: 582,455,100

Normal Paired: 473,366,678 (81.27%)

Chimeric Paired: 46,517 (0.01%)

Chimeric Ambiguous: 304,712 (0.05%)

Unmapped: 108,737,193 (18.67%)

Ligation Motif Present: 270,587,500 (46.46%)

Alignable (Normal+Chimeric Paired): 473,413,195 (81.28%)

Unique Reads: 220,188,897 (37.80%)

PCR Duplicates: 250,812,721 (43.06%)

Optical Duplicates: 932,004 (0.16%)

Library Complexity Estimate: 264,992,336

Intra-fragment Reads: 9,871,762 (1.69% / 4.48%)

Below MAPQ Threshold: 135,181,270 (23.21% / 61.39%)

Hi-C Contacts: 75,135,865 (12.90% / 34.12%)

Ligation Motif Present: 30,950,965 (5.31% / 14.06%)

3' Bias (Long Range): 82% - 18%

Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%

Inter-chromosomal: 18,437,030 (3.17% / 8.37%)

Intra-chromosomal: 56,698,835 (9.73% / 25.75%)

Short Range (<20Kb): 26,233,809 (4.50% / 11.91%)

Long Range (>20Kb): 30,464,761 (5.23% / 13.84%)

595 domains called

and I get

***! Error! The sorted file and dups/no dups files do not add up, or were empty. Merge or dedupping likely failed, restart pipeline with -S merge or -S dedup

Dups don't add up. Check /ifs/data/lehmannlab/couxr01/2/aligned for results

I understand from this post that this can be due to BWA issues, however in Helen's case, she only got small variations in #s of read pairs, I get almost the double and +-15M contacts, have you ever seen something like this and if so do you have any idea on what could explain it?

Thanks

--

RX Coux

Muhammad Saad Shamim

unread,

Oct 4, 2018, 6:54:07 PM10/4/18

to rx....@gmail.com, 3d-ge...@googlegroups.com

Can you check the debug folder for errors in the alignment jobs?

Most likely several jobs with the larger splitsize failed to finish aligning in the time limit, hence more reads with the smaller split size.

Best,

- Muhammad Saad Shamim

Remi Coux

unread,

Oct 5, 2018, 4:40:55 AM10/5/18

to Muhammad Saad Shamim, 3d-ge...@googlegroups.com

Hi, the only errors are in finalcheck

I reran with -C 337500000 and got read and contact # in between 45000000 and 22500000:

Experiment description: Juicer version 1.5.6; BWA 0.7.7-r441; 1 threads; splitsize 33750000; openjdk version "1.8.0_144"; Juicer Tools Version 1.7.6; /ifs/data/lehmannlab/couxr01/juicer/scripts/juicer.sh -g dm6 -s DpnII -z /ifs/data/lehmannlab/couxr01/juicer/references/dm6.fa -p /ifs/data/lehmannlab/couxr01/juicer/restriction_sites/dm6.chrom.sizes -y /ifs/data/lehmannlab/couxr01/juicer/restriction_sites/dm6_DpnII.txt -C 33750000

Sequenced Read Pairs: 411,763,433

Normal Paired: 320,653,357 (77.87%)

Chimeric Paired: 31,593 (0.01%)

Chimeric Ambiguous: 17,495,842 (4.25%)

Unmapped: 73,582,610 (17.87%)

Ligation Motif Present: 270,587,500 (65.71%)

Alignable (Normal+Chimeric Paired): 320,684,950 (77.88%)

Unique Reads: 169,688,696 (41.21%)

PCR Duplicates: 150,207,488 (36.48%)

Optical Duplicates: 788,766 (0.19%)

Library Complexity Estimate: 222,558,484

Intra-fragment Reads: 7,798,990 (1.89% / 4.60%)

Below MAPQ Threshold: 96,131,962 (23.35% / 56.65%)

Hi-C Contacts: 65,757,744 (15.97% / 38.75%)

Ligation Motif Present: 27,333,504 (6.64% / 16.11%)

3' Bias (Long Range): 82% - 18%

Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%

Inter-chromosomal: 16,148,224 (3.92% / 9.52%)

Intra-chromosomal: 49,609,520 (12.05% / 29.24%)

Short Range (<20Kb): 22,847,795 (5.55% / 13.46%)

Long Range (>20Kb): 26,761,504 (6.50% / 15.77%)

578 domains called

debug/finalcheck-a1538689117.out:***! Error! The statistics do not add up. Alignment likely failed to complete on one or more files. Run relaunch_prep.sh

--

RX Coux

Muhammad Saad Shamim

unread,

Oct 5, 2018, 8:21:58 AM10/5/18

to Remi Coux, 3d-ge...@googlegroups.com

Did you try running relaunch_prep.sh as the finalcheck file says on the directory when stats didn't add up? It'll figure out which alignments didn't run if that's the case

Reply all

Reply to author

Forward