Issue with Juicer Tools Pre: "Unable to find All-All or All-All"

119 views
Skip to first unread message

Alice

unread,
Apr 12, 2024, 9:26:41 AM4/12/24
to 3D Genomics
Hi there,

Thank you for spending time developing the nice JUICER suite.

I am currently adapting the Juicer pipeline for it to work on the HPC cluster used by my lab.
In order to do that, I am first re-running the SLURM juicer.sh script "manually" (i.e., testing each command that are in the SLURM scripts, step-by-step) on a single computer, using the small test dataset HIC003 suggested in the Juicer Wiki.
I am currently stuck at the the .hic file generation step, which uses juicer_tools pre.
My input (launched from the relevant 'aligned/' directory) is as follows:
'[JUICER_DIRECTORY]/scripts/juicer_tools pre -n -s inter.txt -g  inter_hists.m -q 1 -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,2000,1000,500,200,100 -f [JUICER_DIRECTORY]/restriction_sites/hg38_Arima.txt --threads 20 -i merged1_index.txt -t HIC_tmp merged1.txt inter.hic [JUICER_DIRECTORY]/references/hg38/hg38.fa > tmp'
(see below for further info about the input files and my set-up)

This gives me the following error on the terminal:
"Unable to find All-All  or  All-All
MNDIndex is empty or could not be read
"
while the first ten lines of the output tmp file look like:
"WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
Skipping >chr1
Skipping NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Skipping NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Skipping NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Skipping NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Skipping NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Skipping NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Skipping NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Skipping NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

(...)"
and its last ten lines look like:
"(...)
Skipping tttagacctttaaggtgagtgttacagcactgaaagatgttatgtccaga
Skipping gttttttccttcagatatttccagagtttcttccatctcgcaggttcatg
Skipping gtcttggtcacttcaagaatgaagctgcagaccttagtggtgagggttac
Skipping agcacttaaaggtgttatgtccagagttttttcctacagatgtgtccaga
Skipping gtttcttccttctggcgggttcatggtcttgctcacttcaagaatgaagc
Skipping tgcagaccttagtgttgagtgttacagcacttaaaggtattaagcccaga
Skipping gtttgttacttctgatgtgtccagaatttcttccttctggcaggttcatg
Skipping gtcttgctcacttcaagaatgaagctgcagacatttacgg
Using 20 CPU thread(s) for primary task
Using 10 CPU thread(s) for secondary task
"

I did take a look inside juicer_tools.jar, and it seems that the error printed on the terminal is generated by MTIndexHandler (see hic/tools/utils/original/MTIndexHandler.class). 
However, given that I have absolutely zero knowledge in Java language, I was not able to debug any further than that.
I guess it is correlated to the fact that, according to the output tmp file, the whole genome is skipped by the process?
Do you have any idea what could be the issue in my situation?

I really wish to implement the use of JUICER in our lab, but I have to admit that I am completely stuck right now...
Any help would be GREATLY appreciated!!
Thanks in advance,

Alice


USEFUL INFO:

My set-up:
- OS: Debian GNU/Linux 11 (bullseye)
- GPU: ASPEED Technology ASPEED Graphics Family version 10 (1a03:2000)
- GNU coreutils: version 8.32-4+b1
- JAVA: openjdk 11.0.1-internal
- BWA: version: 0.7.17-r1188 (installed via bioconda; shouldn't matter anyway for this step)
- JUICER: version 2.0 (obtained with 'git clone https://github.com/theaidenlab/juicer.git')
- JUICER TOOLS: version 2.20.00 (downloaded with 'wget https://github.com/aidenlab/Juicebox/releases/download/v2.20.00/juicer_tools.2.20.00.jar')

My input files:
- inter.txt: this was generated by stats_sub.awk and juicer_tools statistics; the whole file looks like:
"Read type: Paired End
Sequenced Read Pairs:  2,348,578
No chimera found: 20,742 (0.88%)
 One or both reads unmapped: 20,742 (0.88%)
2 alignments: 2,246,797 (95.67%)
 2 alignments (A...B): 1,864,748 (79.40%)
 2 alignments (A1...A2B; A1B2...B1A2): 382,049 (16.27%)
3 or more alignments: 81,039 (3.45%)
Ligation Motif Present: 1,429,457 (60.86%)
Average insert size: 393.27
Total Unique: 2,242,768 (99.82%, 95.49%)
Total Duplicates: 4,029 (0.18%, 0.17%)
Library Complexity Estimate*: 625,721,029
Intra-fragment Reads: N/A
Below MAPQ Threshold: 298,572 (12.71% / 13.31%)
Hi-C Contacts: 1,944,196 (82.78% / 86.69%)
 3' Bias (Long Range): 50% - 50%
 Pair Type %(L-I-O-R): 25% - 25% - 25% - 25%
 L-I-O-R Convergence: 2310
Inter-chromosomal: 365,693 (15.57% / 16.31%)
Intra-chromosomal: 1,578,503 (67.21% / 70.38%)
Short Range (<20Kb):
  <500BP: 282,220 (12.02% / 12.58%)
  500BP-5kB: 94,908 (4.04% / 4.23%)
  5kB-20kB: 141,034 (6.01% / 6.29%)
Long Range (>20Kb): 1,060,341 (45.15% / 47.28%)
"

- inter_hists.m: this was also generated by juicer_tools statistics (I think?) and the file looks like:
"A = [
1 0 7 1 0 1 2 0 0 0 1 4 0 0 0 1 0 0 0 0 1 1 1 1 0 1 0 1 0 1 0 1 3 0 0 0 0 1 0 0 1 0 0 2 2 7 0 1 2 0 3 1 1 1 1 0 0 2 0 1 0 0 0 0 1 1 2 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 5 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(...) 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
];
B = [
0 0 0

(...)
("0 0 0" repeated 200 times)
(...)

0 0 0

];
D = [
317 265 17 14
82 39 5 3
137 69 6 11
182 88 11 10
187 99 11 8
275 99 20 17
457 143 26 20
647 162 20 27
1184 224 25 24
2304 348 26 25
4105 512 14 18
8542 506 47 52
17544 15 51 59

(...)
0 0 0 0

];x = [
10 12 15 19 23 28 35 43 53 66 81 100 123 152 187 231 285 351 433 534 658 811 1000 1233 1520 1874 2310 2848 3511 4329 5337 6579 8111 10000 12328 15199 18738 23101 28480 35112 43288 53367 65793 81113 100000 123285 151991 187382 231013 284804 351119 432876 533670 657933 811131 1000000 1232847 1519911 1873817 2310130 2848036 3511192 4328761 5336699 6579332 8111308 10000000 12328467 15199111 18738174 23101297 28480359 35111917 43287613 53366992 65793322 81113083 100000000 123284674 151991108 187381742 231012970 284803587 351119173 432876128 533669923 657933225 811130831 1000000000 1232846739 1519911083 1873817423 2310129700 2848035868 3511191734 4328761281 5336699231 6579332247 8111308308 10000000000
];
"

- hg38_Arima.txt: this was generated with 'python [JUICER_DIRECTORY]/misc/generate_site_positions.py Arima hg38 [JUICER_DIRECTORY]/references/hg38/hg38.fa', as instructed in the Juicer Wiki (the real dataset that I will have to analyze later was generated by Arima digestion, that is why I am using this); an abbreviated snippet of this file looks like:
"chr1 11160 11507 11522 11685 12411 12461 12686 12829 12910 13244 13315 13420 (...)
chr10 10444 10550 10693 10867 10962 12128 12427 12611 12706 12779 13273 13523 (...)
chr10_GL383545v1_alt 44 49 428 1114 1202 1397 1529 1608 1957 2177 2196 2595 (...)
chr10_GL383546v1_alt 776 825 1280 1437 1729 1903 2077 2381 2610 2814 2830 2988 (...)
chr10_KI270824v1_alt 5 31 47 109 125 174 184 197 207 249 259 272 (...)
chr10_KI270825v1_alt 140 217 351 448 586 627 774 881 1223 1396 1412 1530 (...)
chr11 60245 60536 60814 60823 61117 61300 61879 62228 62266 62579 62690 63017 (...)
chr11_GL383547v1_alt 219 335 461 1072 1132 1691 1960 2474 2781 2892 3026 3261 (...)
chr11_JH159136v1_alt 456 623 697 1030 1124 1364 1582 1707 1965 2560 2615 3160 (...)
chr11_JH159137v1_alt 212 450 471 545 1176 1449 1492 1564 1578 1618 1683 2396 (...)
(...)"

- merged1_index.txt: this was generated by
'[JUICER_DIRECTORY]/scripts/index_by_chr.awk merged1.txt 500000 > merged1_index.txt' and the first 10 lines look like:
"chr10_GL383545v1_alt-chr10_GL383545v1_alt,0,0,62
chr10_GL383545v1_alt-chr11,0,62,51
chr10_GL383545v1_alt-chr20,0,113,52
chr10_GL383545v1_alt-chr9,0,165,50
chr10_GL383546v1_alt-chr10_GL383546v1_alt,0,215,260
chr10_GL383546v1_alt-chr11,0,475,51
chr10_GL383546v1_alt-chr15,0,526,52
chr10_GL383546v1_alt-chrX,0,578,52
chr10-chr10_GL383545v1_alt,0,630,405
chr10-chr10_GL383546v1_alt,0,1035,575

(...)"

- merged1.txt: this was generated by 'samtools view -F 1024 -O sam -@ 20 merged_dedup.sam | awk -v mapq=1 -f [JUICER_DIRECTORY]/scripts/sam_to_pre.awk'; the first 10 lines look like:
"0 chr10_GL383545v1_alt 57538 0 0 chr10_GL383545v1_alt 61540 1
0 chr10_GL383545v1_alt 36528 0 0 chr11 115254730 1
16 chr10_GL383545v1_alt 74747 0 16 chr20 39894595 1
16 chr10_GL383545v1_alt 74957 0 0 chr9 74022902 1
0 chr10_GL383546v1_alt 161260 0 16 chr10_GL383546v1_alt 173032 1
0 chr10_GL383546v1_alt 172060 0 16 chr10_GL383546v1_alt 172207 1
0 chr10_GL383546v1_alt 175382 0 16 chr10_GL383546v1_alt 175636 1
0 chr10_GL383546v1_alt 182752 0 16 chr10_GL383546v1_alt 182979 1
0 chr10_GL383546v1_alt 181070 0 0 chr11 43909533 1
16 chr10_GL383546v1_alt 173120 0 0 chr15 76476336 1

(...)"

- hg38.fa: latest Homo sapiens genome assembly (GRCh38.p14); the first and last 10 lines look like:
">chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

(...)
acagcacttaaaggtgttatgtccagagtttgttcctgcagatgtgtcca
gtttcttcctcctggcaagttcatgctcttgctcacttcaagaatgaaac
tttagacctttaaggtgagtgttacagcactgaaagatgttatgtccaga
gttttttccttcagatatttccagagtttcttccatctcgcaggttcatg
gtcttggtcacttcaagaatgaagctgcagaccttagtggtgagggttac
agcacttaaaggtgttatgtccagagttttttcctacagatgtgtccaga
gtttcttccttctggcgggttcatggtcttgctcacttcaagaatgaagc
tgcagaccttagtgttgagtgttacagcacttaaaggtattaagcccaga
gtttgttacttctgatgtgtccagaatttcttccttctggcaggttcatg
gtcttgctcacttcaagaatgaagctgcagacatttacgg
"

NOTE: The same folder also contains the corresponding BWA index files and a hg38.chrom.sizes file that was generated with "awk 'BEGIN{OFS="\t"}{print $1, $NF}' [JUICER_DIRECTORY]/restriction_sites/hg38_Arima.txt" (as instructed in the Juicer Wiki) and which looks like:
"chr1    248956422
chr10   133797422
chr10_GL383545v1_alt    179254
chr10_GL383546v1_alt    309802
chr10_KI270824v1_alt    181496
chr10_KI270825v1_alt    188315
chr11   135086622
chr11_GL383547v1_alt    154407
chr11_JH159136v1_alt    200998
chr11_JH159137v1_alt    191409

(...)
chrUn_KI270754v1        40191
chrUn_KI270755v1        36723
chrUn_KI270756v1        79590
chrUn_KI270757v1        71251
chrX    156040895
chrX_KI270880v1_alt     284869
chrX_KI270881v1_alt     144206
chrX_KI270913v1_alt     274009
chrY    57227415
chrY_KI270740v1_random  37240
"

Alice

unread,
Apr 16, 2024, 2:04:24 AM4/16/24
to 3D Genomics

Hi again,

Since my previous message, I have actually managed to find why I was having such an error.

In the SLURM version of the "juicer.sh" script, the code line invoking Juicer Tools Pre is as follows (for MAPQ>1 reads):
"time ${juiceDir}/scripts/juicer_tools pre -n -s $outputdir/inter.txt -g $outputdir/inter_hists.m -q 1 $resstr $fragstr $threadHicString $outputdir/merged1.txt $outputdir/inter.hic $genomePath"

I mistakenly thought that the $genomePath variable referred to the path to the whole genome file ([JUICER_DIRECTORY]/references/hg38/hg38.fa in my case).
However, what should be put instead is the path to the chromosome sizes file ([JUICER_DIRECTORY]/references/hg38/hg38.chrom.sizes in my case), as indicated in the Juicebox wiki and which I had failed to notice.

In the end, I was able to create my .hic file with the following command:
"[JUICER_DIRECTORY]/scripts/juicer_tools pre -n -s inter.txt -g  inter_hists.m -q 1 -r 2500000,1000000,500000,250000,100000,50000,25000,10000,5000,2000,1000,500,200,100 -f [JUICER_DIRECTORY]/restriction_sites/hg38_Arima.txt --threads 20 -i merged1_index.txt -t HIC_tmp merged1.txt inter.hic [JUICER_DIRECTORY]/references/hg38/hg38.chrom.sizes"
 
I apologize for the trouble!

 Alice
Reply all
Reply to author
Forward
0 new messages