slurmstepd: get_exit_code task 0 died by signal

5,332 views
Skip to first unread message

Henrik Hornshøj

unread,
Nov 5, 2015, 3:14:54 AM11/5/15
to Genome AU Cluster help
Hello,

Anyone seen this error when submitting R script with slurm and know a fix?
It seems to be random, as it is mostly not re-producible when re-running scripts.

Thanks,
Henrik

Anders Halager

unread,
Nov 5, 2015, 4:10:07 AM11/5/15
to Genome AU Cluster help
This usually means that the program used too much memory or that it crashed with something like a segmentation fault.
You can call the jobinfo command with the failed id and see how long time and how much memory it used.

Anders H

Henrik Hornshøj

unread,
Nov 6, 2015, 2:51:52 AM11/6/15
to Genome AU Cluster help
Thanks, that was also my first thought to check the jobinfo (see example below).
The only hint I can see is that it says it was CANCELLED?
Memory, CPU and walltime seems ok.
I am wondering why it appear to be a random problem.

Henrik


Name                : R
User                : heho
Partition           : normal
Nodes               : s04n64
Cores               : 1
State               : CANCELLED
Submit              : 2015-11-05T22:25:30
Start               : 2015-11-05T22:34:20
End                 : 2015-11-05T22:34:57
Reserved walltime   : 06:00:00
Used walltime       : 00:00:37
Used CPU time       : 00:00:27
% User (Computation): 98.45%
% System (I/O)      :  1.55%
Mem reserved        : 8G/node
Max Mem used        : 481.29M (s04n64)
Max Disk Write      : 168.00M (s04n64)
Max Disk Read       : 342.00M (s04n64)

Anders Halager

unread,
Nov 6, 2015, 3:09:35 AM11/6/15
to Genome AU Cluster help
Turns out it might be a problem in our own code :)

We have some code that makes sure users can ssh into machine they are running jobs on. Once the last job on a node finishes we have to kill all of the users running processes that have been started through ssh and leftover screen sessions etc. Very rarely it seems like too much is killed, but it happens so rarely that we have a hard time testing it.

We have a fix that we think works that will be rolled out shortly, which will hopefully fix the problem. But since we can't reliably reproduce the problem we can't really be sure.

Henrik Hornshøj

unread,
Nov 6, 2015, 3:20:47 AM11/6/15
to Genome AU Cluster help
Thanks, I was thinking I had a problem in my script codes.
At the moment I just re-run those batches that failed.

Henrik
Reply all
Reply to author
Forward
0 new messages