Stage doesn't complete, but reports success

33 views
Skip to first unread message

Marc Hoeppner

unread,
Mar 14, 2017, 10:44:47 AM3/14/17
to bpipe-discuss
Hi, 

sorry, cryptic title but the issue is a bit complex. I am building a GATK pipeline and cannot get Bpipe to get past the base recalibration step.

It starts the job stage and works through the chromosomes and at around chromosome 19 it simply ends the stage and jumps to the next, acting as if the base recalibration was complete. As the output is empty, the next stage fails. 

I put the exact same command line call generated by the faulty bpipe stage in a stand-alone cluster script, using the same resource limits, and it finishes ok. I also ran the command in an interactive shell, and again, it finishes as expected. Only when I run this in Bpipe, I see the described behavior. And it's driving me bonkers...

To reiterate, Bpipe doesn't throw an error and the job does not get killed or anything like that, it just chugs along and pre-maturely reports a success and moves to the next stage. 

I am using version 0.9.9.3 on a Slurm cluster.  Question is how I can debug this!?

Cheers,
Marc

Marc Hoeppner

unread,
Mar 15, 2017, 2:34:06 AM3/15/17
to bpipe-discuss
This is what the problem looks like on the shell:

INFO  16:38:12,377 ProgressMeter -   chr19:8642103   6.5736666E7    58.3 m      53.0 s       85.0%    68.6 m      10.3 m
INFO  16:38:42,378 ProgressMeter -  chr19:20988614   6.6236675E7    58.8 m      53.0 s       85.4%    68.8 m      10.0 m
INFO  16:39:12,379 ProgressMeter -  chr19:42791444   6.6836686E7    59.3 m      53.0 s       86.1%    68.9 m       9.5 m
INFO  16:39:42,380 ProgressMeter -  chr19:56030379   6.7336695E7    59.8 m      53.0 s       86.6%    69.1 m       9.3 m

======================================== Pipeline Succeeded ========================================
16:40:19 MSG:  Finished at Tue Mar 14 16:40:19 CET 2017
16:40:20 MSG:  Output is G00077.L2.realigned.recal_data.grp

And the result file is, of course, empty. 

Simon Sadedin

unread,
Mar 18, 2017, 8:43:55 AM3/18/17
to bpipe-discuss on behalf of Marc Hoeppner
That's very strange behaviour indeed. 

Coincidentally, I have just committed a fix for a Slurm problem where you would see similar behaviour to this. The problem is that when Slurm kills a job for any reason (too much memory, over time limit, etc), it actually returns exit code 0 for the command back to Bpipe. So Bpipe continues on, thinking the command was successful. On most other clusters you get back some kind of failure exit code when a command is killed. So my guess (even though you state you think it's not true) is that the cluster is in fact for some reason killing the job, and Bpipe treats it as successful.  It might be that your job is getting submitted by Bpipe with different limits to what you think. You can check the .bpipe/commandtmp directory to see what's in the actual job script submitted.

If you build from source with the latest master branch you should be able to test this theory, as Bpipe now checks the job completion code as well as the command exit status for Slurm commands.

Hope this helps!

Cheers,

Simon

--
You received this message because you are subscribed to the Google Groups "bpipe-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bpipe-discuss+unsubscribe@googlegroups.com.
To post to this group, send email to bpipe-...@googlegroups.com.
Visit this group at https://groups.google.com/group/bpipe-discuss.
For more options, visit https://groups.google.com/d/optout.

Marc Hoeppner

unread,
Mar 20, 2017, 5:23:45 AM3/20/17
to bpipe-discuss
Hi,
happy to do that. Any chance you can move the recent feature request for slurm options into master? Cannot use slurm without it.

Cheers,
Marc


On Saturday, 18 March 2017 13:43:55 UTC+1, Simon wrote:
That's very strange behaviour indeed. 

Coincidentally, I have just committed a fix for a Slurm problem where you would see similar behaviour to this. The problem is that when Slurm kills a job for any reason (too much memory, over time limit, etc), it actually returns exit code 0 for the command back to Bpipe. So Bpipe continues on, thinking the command was successful. On most other clusters you get back some kind of failure exit code when a command is killed. So my guess (even though you state you think it's not true) is that the cluster is in fact for some reason killing the job, and Bpipe treats it as successful.  It might be that your job is getting submitted by Bpipe with different limits to what you think. You can check the .bpipe/commandtmp directory to see what's in the actual job script submitted.

If you build from source with the latest master branch you should be able to test this theory, as Bpipe now checks the job completion code as well as the command exit status for Slurm commands.

Hope this helps!

Cheers,

Simon
On Wed, Mar 15, 2017 at 5:34 PM, Marc Hoeppner via bpipe-discuss <bpipe-discuss+APn2wQeVsPV16SULDyERXZaJD_VDqUEOkExBVQLX3C9hGs8nGa@googlegroups.com> wrote:
This is what the problem looks like on the shell:

INFO  16:38:12,377 ProgressMeter -   chr19:8642103   6.5736666E7    58.3 m      53.0 s       85.0%    68.6 m      10.3 m
INFO  16:38:42,378 ProgressMeter -  chr19:20988614   6.6236675E7    58.8 m      53.0 s       85.4%    68.8 m      10.0 m
INFO  16:39:12,379 ProgressMeter -  chr19:42791444   6.6836686E7    59.3 m      53.0 s       86.1%    68.9 m       9.5 m
INFO  16:39:42,380 ProgressMeter -  chr19:56030379   6.7336695E7    59.8 m      53.0 s       86.6%    69.1 m       9.3 m

======================================== Pipeline Succeeded ========================================
16:40:19 MSG:  Finished at Tue Mar 14 16:40:19 CET 2017
16:40:20 MSG:  Output is G00077.L2.realigned.recal_data.grp

And the result file is, of course, empty. 

--
You received this message because you are subscribed to the Google Groups "bpipe-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bpipe-discus...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages