Dear Slurm users,
I am looking for a SLURM setting that will kill a job immediately when any subprocess of that job hits an OOM limit. Several posts have touched upon that, e.g: https://www.mail-archive.com/slurm...@lists.schedmd.com/msg04091.html and https://www.mail-archive.com/slurm...@lists.schedmd.com/msg04190.html or https://bugs.schedmd.com/show_bug.cgi?id=3216 but I cannot find an answer that works in our setting.
The two options I have found are:
The reason we want this is that we have script that execute programs in loops. These programs are slow and memory intensive. When the first one crashes for OOM, the next iterations also crash. In the current setup, we are wasting days executing loops where every iteration crashes after an hour or so due to OOM.
We are using cgroups (and we want to keep them) with the following config:
CgroupAutomount=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainKmemSpace=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
MaxSwapPercent=10
TaskAffinity=no
Relevant bits from slurm.conf:
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
SelectType=select/cons_tres
GresTypes=gpu,mps,bandwidth
Very simple example:
#!/bin/bash
# multalloc.sh – each line is a very simple cpp program that allocates a 8Gb vector and fills it with random floats
echo one
./alloc8Gb
echo two
./alloc8Gb
echo three
./alloc8Gb
echo done.
This is submitted as follows:
sbatch --mem=1G ./multalloc.sh
The log is :
one
./multalloc.sh: line 4: 231155 Killed ./alloc8Gb
two
./multalloc.sh: line 6: 231181 Killed ./alloc8Gb
three
./multalloc.sh: line 8: 231263 Killed ./alloc8Gb
done.
slurmstepd: error: Detected 3 oom-kill event(s) in StepId=3130111.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
I am expecting an OOM job kill right before “two”.
Any help appreciated.
Best regards,
Arthur
-------------------------------------------------------------
Dr. Arthur Gilly
Head of Analytics
Institute of Translational Genomics
Helmholtz-Centre Munich (HMGU)
-------------------------------------------------------------
Thank you Loris!
Like many of our jobs, this is an embarrassingly parallel analysis, where we have to strike a compromise between what would be a completely granular array of >100,000 small jobs or some kind of serialisation through loops. So the individual jobs where I noticed this behaviour are actually already part of an array :)
Cheers,
Arthur
-------------------------------------------------------------
Dr. Arthur Gilly
Head of Analytics
Institute of Translational Genomics
Helmholtz-Centre Munich (HMGU)
-------------------------------------------------------------
Any reason *not* to create an array of 100k jobs and let the scheduler just handle things? Current versions of Slurm support arrays of up to 4M jobs, and you can limit the number of jobs running simultaneously with the '%' specifier in your array= sbatch parameter.
From:
slurm-users <slurm-use...@lists.schedmd.com> on behalf of Arthur Gilly <arthur...@helmholtz-muenchen.de>
Date: Tuesday, June 8, 2021 at 4:12 AM
To: 'Slurm User Community List' <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Kill job when child process gets OOM-killed
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
I could say that the limit on max array sizes is lower on our cluster, and we start to see I/O problems very fast as parallelism scales (which we can limit with % as you mention). But the actual reason is simpler, as I mentioned we have an entire collection of scripts which were written for a previous LSF system where the “kill job on OOM” setting was active. What you are suggesting would lead to us rewriting all these scripts so that each submitted job is granular (executes only 1 atomic task) and orchestrate all of it using SLURM dependencies etc. This is a huge undertaking and I’d rather just find this setting, which I’m sure exists.
-------------------------------------------------------------
Dr. Arthur Gilly
Head of Analytics
Institute of Translational Genomics
Helmholtz-Centre Munich (HMGU)
-------------------------------------------------------------
Yep, those are reasons not to create the array of 100k jobs.
From https://www.mail-archive.com/slurm...@lists.schedmd.com/msg04092.html , deeper in the thread from one of your references, there's a mention of using both 'set -o errexit' inside the job script alongside setting an sbatch parameter of '-K' or '--kill-on-bad-exit' to have a job exit if any of its processes exit with a non-zero error code.
Assuming all your processes exit with code 0 when things are running normally, that could be an option.
Thanks Michael, set -e errexit is the same as setting #!/bin/bash -e as interpreter as far as I’m aware. As I mention in the original post, I would like to avoid that. It involves modifying scripts (although to a lesser extent), and it would end script execution for other runtime errors or non-0 exit codes, which may not be desirable. But mainly, it can have unintended consequences on script execution (http://mywiki.wooledge.org/BashFAQ/105), and altogether does not really do what it claims to, potentially causing other hard-to-debug runtime errors. I have officially discouraged our analysts from using it for these reasons, so I would prefer to use this as a very last resort solution.
Sbatch doesn’t seem to have a -K argument, only srun does, which means I’d have to sbatch scripts that launch sbatch commands, which also leads to a significant rewrite… I am starting to think that the feature I am after does not exist! Since several other people have inquired about this in the past, I think it’d be useful to request this as a feature. Is there a place similar to Github issues, where users can make these suggestions to SchedMD?
Cheers,
A