[slurm-users] Troubles with cgroups

324 views
Skip to first unread message

Hermann Schwärzler

unread,
Mar 16, 2023, 10:54:34 AM3/16/23
to slurm...@lists.schedmd.com
Dear Slurm users,

after opening our new cluster (62 nodes - 250 GB RAM, 64 cores each -
Rocky Linux 8.6 - Kernel 4.18.0-372.16.1.el8_6.0.1 - Slurm 22.05) for
"friendly user" test operation about 6 weeks ago we were soon facing
serious problems with nodes that suddenly become unresponsive (so much
so that only a hard reboot via IPMI gets them back).

We were able to narrow the problem down to one similar to this one:
https://github.com/apptainer/singularity/issues/5850
Although in our case it's not related to Singularity but generally to
cgroups.

We are using cgroups in our Slurm configuration to limit RAM, CPUs and
devices. In the beginning we did *not* limit swap space (we are doing so
now to work around the problem but would like to allow at least some
swap space).

We are able to reproduce the problem *outside* Slurm as well by using
the small test program mentioned in the above Singularity GitHub-issue
(https://gist.github.com/pja237/b0e9a49be64a20ad1af905305487d41a) with
these steps (for cgroups/v1):

cd /sys/fs/cgroup/memory
mkdir test
cd test
echo $((5*1024*1024*1024)) > memory.limit_in_bytes
echo $$ > cgroup.procs
/path/to/mempoc 2 10

After about 10 to 30 minutes the problem occurs.

We tried to switch to cgroups/v2. Which does solve the problem for the
manual case outside Slurm:

cd /sys/fs/cgroup
mkdir test
cd test
echo "+memory" > cgroup.subtree_control
mkdir test2
echo $((5*1024*1024*1024)) > test2/memory.high
echo $$ > test2/cgroup.procs
/path/to/mempoc 2 10

Now it runs for days and weeks without any issues!

But when we run the same thing in Slurm (with cgroups/v2 configured to
*not limit* swapping) by using

sbatch --mem=5G --cpus-per-task=10 \
--wrap "/path/to/mempoc 2 10"

the nodes still become unusable after some time (1 to 5 hours) with the
usual symptoms.

Did anyone of you face similar issues?
Are we missing something?
Is it unreasonable to think our systems should stay stable even when
there is cgroup-based swapping?

Kind regards,
Hermann


Jason Simms

unread,
Mar 17, 2023, 4:35:30 PM3/17/23
to Slurm User Community List
Hello,

This isn't precisely related, but I can say that we were having strange issues with system load spiking to the point that the nodes became unresponsive and likewise needed a hard reboot. After several tests and working with our vendor, on nodes that we entirely disabled swap, the problems ceased. You may have an absolutely valid need for swap, or some configurations may in fact rely on it for whatever reason, but for now we've chosen to disable swap on all nodes. It's interesting, however, because I didn't really identify the culprit, and it may be related to cgroups somehow, but regardless, disabling swap appears to be working for us with no immediate consequences.

Warmest regards,
Jason
--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
Schedule a meeting: https://calendly.com/jlsimms

Hermann Schwärzler

unread,
Mar 21, 2023, 12:18:40 PM3/21/23
to slurm...@lists.schedmd.com
Hi Jason,

thank you for your reply.
From what I can tell your problem *is* the same as ours. BTW: we were
already talking about disabling swap in our nodes as a last resort. :-)

In the meantime we made some new findings: we can trigger the error when
(with cgroups/v2) we set memory.high and memory.max to the same value
(as Slurm does it in our configuration). And this error occurs with
Rocky Linux 8.x as well as with 9.x.
It does *not* get triggered when memory.high is set to the limit and
memory.max is set to some higher value or "max".

Our plans are now to check if this is also happening in RHEL. If yes we
will open a support case.
And if time permits we will check if it can be triggered with a vanilla
kernel.

Regards,
Hermann

On 3/17/23 21:34, Jason Simms wrote:
> Hello,
>
> This isn't precisely related, but I can say that we were having strange
> issues with system load spiking to the point that the nodes became
> unresponsive and likewise needed a hard reboot. After several tests and
> working with our vendor, on nodes that we entirely disabled swap, the
> problems ceased. You may have an absolutely valid need for swap, or some
> configurations may in fact rely on it for whatever reason, but for now
> we've chosen to disable swap on all nodes. It's interesting, however,
> because I didn't really identify the culprit, and it may be related to
> cgroups somehow, but regardless, disabling swap appears to be working
> for us with no immediate consequences.
>
> Warmest regards,
> Jason
>
> On Thu, Mar 16, 2023 at 10:59 AM Hermann Schwärzler
> <hermann.s...@uibk.ac.at <mailto:hermann.s...@uibk.ac.at>>
> wrote:
>
> Dear Slurm users,
>
> after opening our new cluster (62 nodes - 250 GB RAM, 64 cores each -
> Rocky Linux 8.6 - Kernel 4.18.0-372.16.1.el8_6.0.1 - Slurm 22.05) for
> "friendly user" test operation about 6 weeks ago we were soon facing
> serious problems with nodes that suddenly become unresponsive (so much
> so that only a hard reboot via IPMI gets them back).
>
> We were able to narrow the problem down to one similar to this one:
> https://github.com/apptainer/singularity/issues/5850
> <https://github.com/apptainer/singularity/issues/5850>
> Although in our case it's not related to Singularity but generally to
> cgroups.
>
> We are using cgroups in our Slurm configuration to limit RAM, CPUs and
> devices. In the beginning we did *not* limit swap space (we are
> doing so
> now to work around the problem but would like to allow at least some
> swap space).
>
> We are able to reproduce the problem *outside* Slurm as well by using
> the small test program mentioned in the above Singularity GitHub-issue
> (https://gist.github.com/pja237/b0e9a49be64a20ad1af905305487d41a
> <https://gist.github.com/pja237/b0e9a49be64a20ad1af905305487d41a>) with
> *Jason L. Simms, Ph.D., M.P.H.*
> Manager of Research Computing
> Swarthmore College
> Information Technology Services
> (610) 328-8102
> Schedule a meeting: https://calendly.com/jlsimms
> <https://calendly.com/jlsimms>

Jason Simms

unread,
Mar 21, 2023, 12:42:44 PM3/21/23
to Slurm User Community List
Hello Hermann,

Thanks for following up about this. What you say makes sense: at Lafayette, we didn't experience the issue until upgrading to a Slurm version that supported cgroups/v2, and here at Swarthmore, we are still on a version of Slurm that doesn't and we don't have the issue (both Rocky 8). At this point I'm still intending to disable swap entirely during our next maintenance cycle, but if we can identify the issue and a reasonable solution, perhaps we don't have to.

Please keep us posted on how your testing goes! It would be great to confirm that it's a reproducible bug and not something related to our hardware or configuration.

Warmest regards,
Jason
--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing

Hermann Schwärzler

unread,
May 17, 2023, 9:00:33 AM5/17/23
to slurm...@lists.schedmd.com
Hi everybody,

I would like to give you a quick update on this problem (hanging systems
when swapping due to cgroup memory-limits is happening):

We had opened a case with RedHat's customer support. After some to and
fro they could reproduce the problem. Last week they told us to upgrade
to version 9.2 (kernel 5.14.0-284.11.1.el9_2.x86_64) which has been
released a few days ago.
Doing so *did indeed fix the problem*!

The related bug was/is one in main-line kernels and its fix (from March
2022) is discussed and described here:
https://lore.kernel.org/all/20220221111749.1...@gmail.com/

We asked support whether there are plans to backport this to 8.x
versions of the system. The answer was: a BUG is raised and it is
expected to be fixed in version 8.9 (release date not yet set).

Regards,
Hermann

Reply all
Reply to author
Forward
0 new messages