job dependency issue on SLURM

1,447 views
Skip to first unread message

Zhifei Luo

unread,
May 24, 2018, 4:04:44 PM5/24/18
to 3D Genomics

Hi there,

I tried to run the SLURM scripts on our hpc cluster and got many errors. After changing the resource settings in the script,  most of them can be fixed except the below one.

I found the line 957 in the script is causing the problem.  

956 #Push dedup guard to run only after dedup is complete:
957 scontrol update JobID=$guardjid dependency=afterok:$jid


The error message is : Unexpected message received for job XXX (I believe it is the $guardjid)
                                     sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)

and reset of scripts failed due to job dependency.

Deletion of line 957 can fix the error and all the jobs can be submitted but I worry the dedup guard might start earlier than the dedup.

The HPC stuff suggests me to put job dependency in the initial sbatch command. How can I fix this bug?




Zhifei Luo

unread,
May 24, 2018, 11:36:49 PM5/24/18
to 3D Genomics
The IT people told me slurm does not allow you to update your job after submission unless you are an administrator, operator or account coordinator...
So how to edit the script to avoid using 'scontrol update' ?

Neva Durand

unread,
May 25, 2018, 6:22:17 AM5/25/18
to Zhifei Luo, 3D Genomics
For SLURM, the scontrol update is the solution to wait for dedupping before further processing, and so is essential to the script.  You might need to run Juicer in two steps: run with the flag "-S early", which will stop after dedupping, and then run with "-S final" when dedupping is done.


--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/7822417f-0858-4e2f-93fa-aed2064738bb%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Neva Cherniavsky Durand, Ph.D.
Staff Scientist, Aiden Lab

Zhifei Luo

unread,
May 29, 2018, 11:10:39 PM5/29/18
to 3D Genomics
Thanks for the suggestions. I tried the '-S early' but the 'scontrol update' command at line 948 and 957 made the script failed to run. Is it possible to run -S with single CPU version and run the -S final use SLURM? Do you have other ideas? Thanks for you time!


On Friday, May 25, 2018 at 3:22:17 AM UTC-7, Neva Durand wrote:
For SLURM, the scontrol update is the solution to wait for dedupping before further processing, and so is essential to the script.  You might need to run Juicer in two steps: run with the flag "-S early", which will stop after dedupping, and then run with "-S final" when dedupping is done.

On Thu, May 24, 2018 at 11:36 PM, Zhifei Luo <zhifei...@gmail.com> wrote:
The IT people told me slurm does not allow you to update your job after submission unless you are an administrator, operator or account coordinator...
So how to edit the script to avoid using 'scontrol update' ?
 

On Thursday, May 24, 2018 at 1:04:44 PM UTC-7, Zhifei Luo wrote:

Hi there,

I tried to run the SLURM scripts on our hpc cluster and got many errors. After changing the resource settings in the script,  most of them can be fixed except the below one.

I found the line 957 in the script is causing the problem.  

956 #Push dedup guard to run only after dedup is complete:
957 scontrol update JobID=$guardjid dependency=afterok:$jid


The error message is : Unexpected message received for job XXX (I believe it is the $guardjid)
                                     sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)

and reset of scripts failed due to job dependency.

Deletion of line 957 can fix the error and all the jobs can be submitted but I worry the dedup guard might start earlier than the dedup.

The HPC stuff suggests me to put job dependency in the initial sbatch command. How can I fix this bug?




--
You received this message because you are subscribed to the Google Groups "3D Genomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics...@googlegroups.com.

Neva Durand

unread,
Jun 1, 2018, 5:14:08 PM6/1/18
to Zhifei Luo, 3D Genomics
This should probably be fixed - before lines 951 and 957 could add a check to see if stage is early exit.

For you, you can just run with "-S early" and comment out those two lines.  And then run "-S final".

To unsubscribe from this group and stop receiving emails from it, send an email to 3d-genomics+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/3d-genomics/9ffd47bf-fd09-4a89-a147-71d234cc83d6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages