run juicer.sh with slurm,spack command not found

Ao Qo

unread,

Jan 16, 2022, 7:38:22 AM1/16/22

to 3D Genomics

Hello,everyone

I am running juicer using SLURM. I have copied the the folder scripts underneath SLURM to my workdir. At the same time, I have prepared the reference genome directory including genome and the BWA index files and the folder restriction_sites. My command is ./scripts/juicer.sh -g Ma6.genome -d /data/01/user157/HIFI/Hic-anchor/Ma6/juicer/work/hic_data -s MboI -p ./restriction_sites/Ma6.genome.chrom.sizes -y ./restriction_sites/Ma6.genome_MboI.txt -z ./references/M.aspalax.6.bp.p_ctg.fasta -D /data/01/user157/HIFI/Hic-anchor/Ma6/juicer/work -t 80 -q pNormal -l pNormal. But I get a err in the debug folder: spack command not found. Then I check the juicer.sh, I guess I should modify the path of software. However I am not a root user, I can not use module to load these variables. May if I can to set the path of software like load_bwa="export PATH=/data/01/user157/software/bwa:$PATH" , load_java="export PATH=/usr/bin/:$PATH" and load gpu=load_java=" export PATH=/usr/bin/:$PATH". The following picture is the path of bwa, java and gpu. Also, should I modify other parameters in the juicer.sh like the queue_time, long_queue_time, memory. Because I am fresh and no experiences about juicer. My genome size is 3G . Could you give me any suggestions ?

Looking forward with your reply! Best wishes!

Message has been deleted

Ao Qo

unread,

Jan 17, 2022, 10:04:36 PM1/17/22

to 3D Genomics

Hello,everyone

I have not run juicer with SLURM successfully so far. I chose to link the CPU directory to my work directory and then run with CPU. The command is "./scripts/juicer.sh -g Ma6.genome -d /data/01/user157/HIFI/Hic-anchor/Ma6/juicer/work/hic_data -s MboI -p ./restriction_sites/Ma6.genome.chrom.sizes -y ./restriction_sites/Ma6.genome_MboI.txt -z ./references/M.aspalax.6.bp.p_ctg.fasta -D /data/01/user157/HIFI/Hic-anchor/Ma6/juicer/work -t 80 -e early". However, it seems that the hic raw reads have not been split and I can not find the debug directory. The following picture is the content of split directory. I just check the process and It is running with chimera read handling . I am really confused what is chimera read handling? Are there any thing which I can refer to about the output?

I would appreciate it if you could give me any suggestions?

Best wishes!

Muhammad Shamim

unread,

Jan 27, 2022, 7:00:02 AM1/27/22

to 3D Genomics

This section:
```
load_bwa="spack load b...@0.7.17 arch=\`spack arch\`"
load_awk="spack load ga...@4.1.4 arch=\`spack arch\`"
load_gpu="spack load cu...@8.0.61 arch=\`spack arch\` && CUDA_VISIBLE_DEVICES=0,1,2,3"
load_samtools="spack load samt...@1.13 arch=\`spack arch\`"
call_bwameth="/gpfs0/home/neva/bwa-meth/bwameth.py"
juiceDir="/gpfs0/juicer2"
queue="commons"
queue_time="2880"
long_queue="long"
long_queue_time="7200"
```
Can absolutely be overwritten or removed for your specific cluster.

Those first 3 are basically examples of 3 different clusters.

But I think we can change the default to not be voltron, which could cause confusion with the spack issue.

If you are able to call bwa directly without needing to use module load or a similar command, you can actually just use:
`load_bwa=""`
If bwa is not available on the compute nodes, then you could certainly edit it the way you've described above.

Roughly how much sequencing data are you processing? If 100s of millions or billions of reads, the SLURM flavor is likely most appropriate (assuming you are working on a SLURM cluster). Also if this is a SLURM cluster, IT may disable trying to launch a big job on the headnode using the CPU version. Can you confirm that this is indeed a SLURM cluster?

Ao Qo

unread,

Feb 10, 2022, 8:08:41 AM2/10/22

to 3D Genomics

Hello

I am sorry for delayed reply.

1. I have about 750G pair-end hic reads and 3.1G genome assembly. When I chose to run with use CPU scripts on the headnode, the task will be killed! So I submit the CPU task by SLURM :

#!/bin/sh
#SBATCH -c 40 --mem 40G
#SBATCH --partition=pNormal
#SBATCH --qos=normal
#SBATCH --get-user-env
#SBATCH -o Ma6.juicer.sh.job2022117115111/Ma6.juicer.sh.split.sh.1.sl.out
#SBATCH -e Ma6.juicer.sh.job2022117115111/Ma6.juicer.sh.split.sh.1.sl.err
#SBATCH -D /data/01/user157/fenshu/HIFI/Hic-anchor/Ma6/juicer/work
./scripts/juicer.sh -g Ma6.genome -d /data/01/user157/fenshu/HIFI/Hic-anchor/Ma6/juicer/work/hic_data -s MboI -p restriction_sites/Ma6.genome.chrom.sizes -y restriction_sites/Ma6.genome_MboI.txt -z references/M.aspalax.6.bp.p_ctg.fasta -D /data/01/user157/fenshu/HIFI/Hic-anchor/Ma6/juicer/work -t 80

Although I set the thread 80, the most steps is only single thread. The path of Bwa, samtools and java have been added to the .zshrc and are available on the current node,maybe if I can remove all section about "load bwa" "load java" and so on, and then definite like this in the SLURM:

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 8
#SBATCH -t 5-5:00:00
#SBATCH --mem-per-cpu=2G
#SBATCH --partition=pNormal
#SBATCH --qos=normal
#SBATCH --get-user-env
#SBATCH --job-name=Ma6.juicer
#SBATCH --mail-type=end
#SBATCH --export=ALL
#SBATCH --array=1-10
#SBATCH --output=Ma6.juicer.out
#SBATCH --error=Ma6.juicer.err

./scripts/juicer.sh -g Ma6.genome -d /data/01/user157/fenshu/HIFI/Hic-anchor/Ma6/juicer/work/hic_data -s MboI -p restriction_sites/Ma6.genome.chrom.sizes -y
restriction_sites/Ma6.genome_MboI.txt -z references/M.aspalax.6.bp.p_ctg.fasta -D /data/01/user157/fenshu/HIFI/Hic-anchor/Ma6/juicer/work -S final -t 80

2. Also , when I finished the CPU run, finally I get the the aligned folder containing the results:header,inter_30.txt,inter.txt,merged1.txt,merged30.txt,merged_dedup.bam; but it lacks these files:

collisions.txt, dups.txt, opt_dups.txt ,merged_sort and merged_nodups.txt which is needed to run 3d-dna. Because I only need to get merged_nodups.txt for 3d-dna, so my command after crashing due to out of memory:

#!/bin/sh
#SBATCH -c 40 --mem 90G
#SBATCH --partition=pNormal
#SBATCH --qos=normal
#SBATCH --get-user-env
#SBATCH -o Ma6.juicer.sh.job2022121173823/Ma6.juicer.sh.split.sh.1.sl.out
#SBATCH -e Ma6.juicer.sh.job2022121173823/Ma6.juicer.sh.split.sh.1.sl.err
#SBATCH -D /data/01/user157/fenshu/HIFI/Hic-anchor/Ma6/juicer/work
./scripts/juicer.sh -g Ma6.genome -d /data/01/user157/fenshu/HIFI/Hic-anchor/Ma6/juicer/work/hic_data -s MboI -p restriction_sites/Ma6.genome.chrom.sizes -y restriction_sites/Ma6.genome_MboI.txt -z references/M.aspalax.6.bp.p_ctg.fasta -D /data/01/user157/fenshu/HIFI/Hic-anchor/Ma6/juicer/work -S dedup -t 80 -e early

After I check the juicer.sh in the CPU, I actually found the juicer.sh of the CPU (from git clone https://github.com/aidenlab/juicer.git) and the juicer.sh of the CPU (from wget https://github.com/aidenlab/juicer/archive/refs/tags/1.6.tar.gz) is different. I install juicer by git clone https://github.com/aidenlab/juicer.git, and I do not find some command generating these files (some command about generating these file (collisions.txt, dups.txt, opt_dups.txt ,merged_sort and merged_nodups.txt) but the juicer installed by wget https://github.com/aidenlab/juicer/archive/refs/tags/1.6.tar.gz has the juicer.sh containing these command generating the files I lacks. In the end, I use the command to generate the merged_nodups.txt :samtools view -@ 80 -O SAM -F 1024 hic_data/aligned/merged_dedup.bam | awk -v mnd=1 -f scripts/common/sam_to_pre.awk > hic_data/aligned/merged_nodups.txt. I actually have some auestions:

(1) Why I can not generate the merged_nodups.txt. And My log file is:

Using restriction_sites/Ma6.genome_MboI.txt as site file
(-: Mark duplicates done successfully
(-: Pipeline successfully completed (-:
Run cleanup.sh to remove the splits directory
Check /data/01/user157/fenshu/HIFI/Hic-anchor/Ma6/juicer/work/hic_data/aligned for results

(2) why the different juicer.sh exists. which one is correct?

Reply all

Reply to author

Forward