Running juicer on test data

1,121 views
Skip to first unread message

Mark Maienschein-Cline

unread,
Mar 14, 2019, 1:01:50 PM3/14/19
to 3D Genomics
I was running juicer on the test data described on the Quick Start guide on the wiki (https://github.com/aidenlab/juicer/wiki/Installation#quick-start), following the steps under "Example with cluster version and downloading reference files". However, I received an error that no contacts were found in the practice data:

java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment.

        at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650)

        at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419)

        at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832)

        at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582)

        at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346)

        at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116)

        at juicebox.tools.HiCTools.main(HiCTools.java:96)


Is this an error on my end, or is it because the practice data is too low-coverage to work with default parameters? If the latter, which parameters would be best to change? I did not see a -q flag in the juicer.sh wrapper script.

Thanks,
Mark

Neva Durand

unread,
Mar 14, 2019, 1:05:38 PM3/14/19
to Mark Maienschein-Cline, 3D Genomics
No, you should have data. What system are you running under?

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/d295490e-b0a5-4cc7-a07f-ee06175f36c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Neva Cherniavsky Durand, Ph.D.
Staff Scientist, Aiden Lab

Mark Maienschein-Cline

unread,
Mar 14, 2019, 1:22:38 PM3/14/19
to Neva Durand, 3D Genomics
Hi,

I’m running on a campus HPC resource at University of Illinois at Chicago, which is running CentOS. I’m also using Java 1.8.0_73.

Thanks,
Mark

Neva Durand

unread,
Mar 14, 2019, 1:23:44 PM3/14/19
to Mark Maienschein-Cline, 3D Genomics
Are you using the CPU version or a cluster version? Was this the only error you saw? What files do you see?

Mark Maienschein-Cline

unread,
Mar 14, 2019, 1:29:03 PM3/14/19
to Neva Durand, 3D Genomics
Hi,

I’m using the CPU version, but running on an interactive node on screen.

Here’s the full list of files in the output directory that I get:

-rw-rw----+ 1 mmaiensc rrc_shared  164323215 Jul 24  2016 splits/HIC003_S2_L001_R1_001.fastq.gz
-rw-rw----+ 1 mmaiensc rrc_shared  164323215 Jul 24  2016 fastq/HIC003_S2_L001_R1_001.fastq.gz
-rw-rw----+ 1 mmaiensc rrc_shared  174944660 Jul 24  2016 splits/HIC003_S2_L001_R2_001.fastq.gz
-rw-rw----+ 1 mmaiensc rrc_shared  174944660 Jul 24  2016 fastq/HIC003_S2_L001_R2_001.fastq.gz
-rw-rw----+ 1 mmaiensc rrc_shared        580 Mar 13 16:09 aligned/header
-rw-rw----+ 1 mmaiensc rrc_shared          8 Mar 13 16:09 splits/HIC003_S2_L001_001.fastq.gz_linecount.txt
-rw-rw----+ 1 mmaiensc rrc_shared 1599872900 Mar 13 16:30 splits/HIC003_S2_L001_001.fastq.gz.sam
-rw-rw----+ 1 mmaiensc rrc_shared         42 Mar 13 16:31 splits/HIC003_S2_L001_001.fastq.gz_norm.txt.res.txt
-rw-rw----+ 1 mmaiensc rrc_shared   68424725 Mar 13 16:31 splits/HIC003_S2_L001_001.fastq.gz_abnorm.sam
-rw-rw----+ 1 mmaiensc rrc_shared   12713214 Mar 13 16:31 splits/HIC003_S2_L001_001.fastq.gz_unmapped.sam
-rw-rw----+ 1 mmaiensc rrc_shared  791011665 Mar 13 16:33 splits/HIC003_S2_L001_001.fastq.gz.sort.txt
-rw-rw----+ 1 mmaiensc rrc_shared  791011665 Mar 13 16:33 aligned/merged_sort.txt
-rw-rw----+ 1 mmaiensc rrc_shared    1039366 Mar 13 16:33 aligned/opt_dups.txt
-rw-rw----+ 1 mmaiensc rrc_shared    1169046 Mar 13 16:33 aligned/dups.txt
-rw-rw----+ 1 mmaiensc rrc_shared  788803253 Mar 13 16:33 aligned/merged_nodups.txt
-rw-rw----+ 1 mmaiensc rrc_shared        346 Mar 13 16:33 aligned/stats_dups.txt
-rw-rw----+ 1 mmaiensc rrc_shared       7143 Mar 13 16:33 aligned/stats_dups_hists.m
-rw-rw----+ 1 mmaiensc rrc_shared       1857 Mar 13 16:34 aligned/inter.txt
-rw-rw----+ 1 mmaiensc rrc_shared      10027 Mar 13 16:34 aligned/inter_hists.m
-rw-rw----+ 1 mmaiensc rrc_shared   68424725 Mar 13 16:34 aligned/abnormal.sam
-rw-rw----+ 1 mmaiensc rrc_shared   12713214 Mar 13 16:34 aligned/unmapped.sam
-rw-rw----+ 1 mmaiensc rrc_shared    3242857 Mar 13 16:34 aligned/collisions.txt
-rw-rw----+ 1 mmaiensc rrc_shared      12294 Mar 13 16:34 aligned/inter.hic
-rw-rw----+ 1 mmaiensc rrc_shared       1858 Mar 13 16:35 aligned/inter_30.txt
-rw-rw----+ 1 mmaiensc rrc_shared       9681 Mar 13 16:35 aligned/inter_30_hists.m
-rw-rw----+ 1 mmaiensc rrc_shared      11949 Mar 13 16:35 aligned/inter_30.hic

aligned/inter_30_contact_domains.txt:
total 0

Thanks,
Mark

Neva Durand

unread,
Mar 14, 2019, 1:31:44 PM3/14/19
to Mark Maienschein-Cline, 3D Genomics
Could you paste the inter.txt file?

Mark Maienschein-Cline

unread,
Mar 14, 2019, 1:33:09 PM3/14/19
to Neva Durand, 3D Genomics
Sure:

Experiment description: Juicer version 1.5.6; BWA 0.7.15-r1140; 4 threads; java version "1.8.0_73"; /home/mmaiensc/clustcrilab/tools/juicer/juicer/scripts/juicer.sh -D /home/mmaiensc/clustcrilab/tools/juicer/juicer -z /mnt/store3/clustcrilab/mark_projects/2019.03-akenter/hic/test_data/references/Homo_sapiens_assembly19.fasta -p /mnt/store3/clustcrilab/mark_projects/2019.03-akenter/hic/test_data/references/Homo_sapiens_assembly19.fasta.fai -y /mnt/store3/clustcrilab/mark_projects/2019.03-akenter/hic/test_data/restriction_sites/hg19_MboI.txt -t 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
Sequenced Read Pairs:  2,348,578
 Normal Paired: 1,865,156 (79.42%)
 Chimeric Paired: 387,270 (16.49%)
 Chimeric Ambiguous: 74,691 (3.18%)
 Unmapped: 21,461 (0.91%)
 Ligation Motif Present: 895,318 (38.12%)
Alignable (Normal+Chimeric Paired): 2,252,426 (95.91%)
Unique Reads: 2,246,197 (95.64%)
PCR Duplicates: 3,284 (0.14%)
Optical Duplicates: 2,945 (0.13%)
Library Complexity Estimate: 769,677,025
Intra-fragment Reads: 40,541 (1.73% / 1.80%)
Below MAPQ Threshold: 232,991 (9.92% / 10.37%)
Hi-C Contacts: 1,972,665 (83.99% / 87.82%)
 Ligation Motif Present: 571,575  (24.34% / 25.45%)
 3' Bias (Long Range): 70% - 30%
 Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%
Inter-chromosomal: 389,272  (16.57% / 17.33%)
Intra-chromosomal: 1,583,393  (67.42% / 70.49%)
Short Range (<20Kb): 493,483  (21.01% / 21.97%)
Long Range (>20Kb): 1,089,908  (46.41% / 48.52%)

Neva Durand

unread,
Mar 14, 2019, 1:38:15 PM3/14/19
to Mark Maienschein-Cline, 3D Genomics
These stats look right. I don’t know why the hic file creation didn’t work - did you download the latest Juicer tools jar and soft link it to juicer_tools.jar?

I would run this command after being sure the jar is up to date:

juicer_tools pre -f <restriction site file> -s inter.txt -g inter_hists.m -q 1 merged_nodups.txt inter.hic hg19

Mark Maienschein-Cline

unread,
Mar 14, 2019, 2:11:11 PM3/14/19
to Neva Durand, 3D Genomics
Hi,

I downloaded the tools yesterday, so should be up to date. Running that command worked fine, output to terminal was:

~/clustcrilab/tools/juicer/juicer/scripts/common/juicer_tools pre -f ../../restriction_sites/hg19_MboI.txt -s inter.txt -g inter_hists.m -q 1 merged_nodups.txt inter.hic hg19
Start preprocess
Writing header
Writing body
......................................................................................................................................................................................................................................................................................................................................
Writing footer

Finished preprocess

Calculating norms for zoom BP_2500000
Calculating norms for zoom BP_1000000
Calculating norms for zoom BP_500000
Calculating norms for zoom BP_250000
Calculating norms for zoom BP_100000
Calculating norms for zoom BP_50000
Calculating norms for zoom BP_25000
Calculating norms for zoom BP_10000
Calculating norms for zoom BP_5000
Calculating norms for zoom FRAG_500
Calculating norms for zoom FRAG_200
Calculating norms for zoom FRAG_100
Calculating norms for zoom FRAG_50
Calculating norms for zoom FRAG_20
Calculating norms for zoom FRAG_5
Calculating norms for zoom FRAG_2
Calculating norms for zoom FRAG_1
Writing expected
Writing norms
Finished writing norms


However, from the usage in juicer.sh, it seems that the “hg19” parameter should be a path to the genome file (I provided mine with the -z flag), right?

Thanks,
Mark

Neva Durand

unread,
Mar 14, 2019, 2:35:21 PM3/14/19
to Mark Maienschein-Cline, 3D Genomics
No, the -z flag is something else (the reference file, not the genome ID). That argument is for the chrom.sizes file; for standard genomes, you can send in the ID. It would be the -p flag. I'll review the documentation to ensure this is clearer.

rem...@gmail.com

unread,
Mar 15, 2019, 7:17:05 AM3/15/19
to Mark Maienschein-Cline, 3D Genomics

Hi Mark,

I found the same result.  There is a second and higher depth dataset hosted on the site that I used without issue.

BTW. Check out my new tutorial on juicer.  By far not complete, but illustrates some of my issues and workarounds. 

https://isugenomics.github.io/bioinformatics-workbook/dataAnalysis/GenomeAssembly/Hybrid/Scaffolding_with_HiC_Juicer.html

 

Rick

--

Neva Durand

unread,
Mar 15, 2019, 7:23:16 AM3/15/19
to Rick Masonbrink, Mark Maienschein-Cline, 3D Genomics
Thanks Rick.

We can very easily add a flag to get rid of "wobble", which will do what you already do with the dedup workaround. Let me know if this is of interest.

I've investigated the problem where dups cause memory issues extensively recently and it's almost entirely due to MAPQ 0 reads aligning within wobble. The next version of Juicer (Juicer 2.0, being developed in conjunction with ENCODE) will shunt these reads off earlier in the process, which eliminates the issue; however, for assembly purposes there might be a desire to keep the MAPQ 0 reads.


For more options, visit https://groups.google.com/d/optout.

rem...@gmail.com

unread,
Mar 15, 2019, 10:57:50 AM3/15/19
to Neva Durand, Mark Maienschein-Cline, 3D Genomics

Hi Neva,

 

Yes, I ran into the same issue with eliminating these reads for assembly.  It does cause a problem for repetitive scaffold assembly, a serious issue for people like me that study centromeres.  My pipeline still has a problem, in that I still have 3 of my 64 jobs still deduping.  Perhaps just eliminating 90% of the reads that go to these regions would be better than 100%, and still accelerate the process.

You all are the experts, and I would appreciate any comments or suggestions on the tutorial. There may be better ways to do things, but I did not see them.

 

I also had a horrible time trying to figure out how to use your Slurm scripts, which are configured differently from our HPC.  Is there a way to make juicer just print out all of the jobs line by line , so I can submit each to our SLURM system more easily?  Thanks

Sorry Mark, wasn’t intending to hijack your thread.

 

Rick

Neva Durand

unread,
Mar 15, 2019, 11:27:18 AM3/15/19
to Rick Masonbrink, Mark Maienschein-Cline, 3D Genomics
I will add a flag that dedups with no wobble.

For SLURM: Look for lines starting with "jid. E.g. line 414.
Replace jid=`sbatch <<- HEADER | egrep -o -e "\b[0-9]+$"
With 

cat << HEADER >> tmp1 


This will write the job to tmp1. You can do this with every "jid=" (using different output file names for each one).

The timing on job submissions can certainly be tricky; this was the biggest engineering challenge for us. 

For the HIC003 case I'm still surprised there was a problem (seeing as the merged_nodups file was fine). 

Neva Durand

unread,
Mar 15, 2019, 2:49:13 PM3/15/19
to Rick Masonbrink, Mark Maienschein-Cline, 3D Genomics
One more note on SLURM - the jids that get passed to the "hold" won't be there, so you will have to manually replace those in the tmp scripts.

Neva Durand

unread,
Aug 8, 2019, 12:54:10 PM8/8/19
to 3D Genomics
This is now done in Juicer (SLURM)

Flag -j for "just exact matches"; this will only eliminate exact match duplicates. Overall this flag is not recommended. However, if you find your jobs are often getting stuck at the dedup phase, it can be because of low complexity or low mapping quality and this flag will allow the jobs to finish much faster. You will still be left with near-duplicates in your library - so use caution when interpreting results. In particular note that near-duplicates are usually machine errors, not true biological results, and thus ought to be removed.

To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.

To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics+unsubscribe@googlegroups.com.


 

--

Neva Cherniavsky Durand, Ph.D.

Staff Scientist, Aiden Lab

--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.

To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics+unsubscribe@googlegroups.com.

Reply all
Reply to author
Forward
0 new messages