[slurm-users] swap size

A

unread,

Sep 21, 2018, 5:33:04 PM9/21/18

to Slurm User Community List

I have a single node slurm config on my workstation (18 cores, 256 gb ram, 40 Tb disk space). I recently just extended the array size to its current config and am reconfiguring my LVM logical volumes.

I'm curious on people's thoughts on swap sizes for a node. Redhat these days recommends up to 20% of ram size for swap size, but no less than 4 gb.

But......according to slurm faq;

"Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals respectively, so swap and disk space should be sufficient to accommodate all jobs allocated to a node, either running or suspended."

So I'm wondering if 20% is enough, or whether it should scale by the number of single jobs I might be running at any one time. E.g. if I'm running 10 jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap?

any thoughts?

-ashton

John Hearns

unread,

Sep 22, 2018, 1:05:43 AM9/22/18

to Slurm User Community List

Ashton, on a compute node with 256Gbytes of RAM I would not
configure any swap at all. None.
I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM -
and no swap.
Also our ICE clusters were diskless - SGI very smartly configured swap
over ISCSI - but we disabled this, the reason being that if one node
in a job starts swapping the likelihood is that all the nodes are
swapping, and things turn to treacle from there.
Also, as another issue, if you have lots of RAM you need to look at
the vm tunings for dirty ratio, background ratio and centisecs. Linux
will aggressively cache data which is written to disk - you can get a
situation where your processes THINK data is written to disk but it is
cached, then what happens of there is a power loss? SO get those
caches flushed often.
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

Oh, and my other tip. In the past vm.min_free_kbytes was ridiculously
small on default Linux systems. I call this the 'wriggle room' when a
system is short on RAM. Think of it like those square sliding letters
puzzles - min_free_kbytes is the empty square which permits the letter
tiles to move.
SO look at your min_free_kbytes and increase it (If I'm not wrong in
RH7 and Centos7 systems it is a reasonable value already)
https://bbs.archlinux.org/viewtopic.php?id=184655

Oh, and it is good to keep a terminal open with 'watch cat
/proc/meminfo' I have spent many a happy hour staring at that when
looking at NFS performance etc. etc.

Back to your specific case. My point is that for HPC work you should
never go into swap (with a normally running process, ie no job
pre-emption). I find that 20 percent rule is out of date. Yes,
probably you should have some swap on a workstation. And yes disk
space is cheap these days.

However, you do talk about job pre-emption and suspending/resuming
jobs. I have never actually seen that being used in production.
At this point I would be grateful for some education from the choir -
is this commonly used and am I just hopelessly out of date?
Honestly, anywhere I have managed systems, lower priority jobs are
either allowed to finish, or in the case of F1 we checkpointed and
killed low priority jobs manually if there was a super high priority
job to run.

A

unread,

Sep 22, 2018, 2:03:01 AM9/22/18

to Slurm User Community List

Hi John! Thanks for the reply, lots to think about.

In terms of suspending/resuming, my situation might be a bit different than other people. As I mentioned this is an install on a single node workstation. This is my daily office machine. I run alot of python processing scripts that have low CPU need but lots of iterations. I found it easier to manage these in slurm, opposed to writing mpi/parallel processing routines in python directly.

Given this, sometimes I might submit a slurm array with 10K jobs, that might take a week to run, but I still need to sometimes do work during the day that requires more CPU power. In those cases I suspend the background array, crank through whatever I need to do and then resume in the evening when I go home. Sometimes I can say for jobs to finish, sometimes I have to break in the middle of running jobs

Raymond Wan

unread,

Sep 22, 2018, 2:19:54 AM9/22/18

to Slurm User Community List

Hi Ashton,

On Sat, Sep 22, 2018 at 5:34 AM A <andre...@gmail.com> wrote:
> So I'm wondering if 20% is enough, or whether it should scale by the number of single jobs I might be running at any one time. E.g. if I'm running 10 jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap?

Perhaps I'm a bit clueless here, but maybe someone can correct me if I'm wrong.

I don't think swap space or a swap file is used like that. If you
have 256 GB of memory and a 256 GB swap file (I don't suggest this
size...it just makes my math easier :-) ), then from the point of view
of the OS, it will appear there is 512 GB of memory. So, this is
memory that is used while it is running...for reading in data, etc.

SLURM's ability to suspend jobs must be storing the state in a
location outside of this 512 GB. So, you're not helping this by
allocating more swap.

What you are doing is perhaps allowing more jobs to run concurrently,
but I would caution against allocating more swap space. After all,
disk read/write is much slower than memory. If you can run 10 jobs
within 256 GB of memory but 20 jobs within 512 GB of (memory + swap
space), I think you should do some kind of test to see if it would be
faster to just let 10 jobs run. Since disk I/O is slower, I doubt
you're going to get double the running time.

Personally, I still create swap space, but I agree with John that a
server with 256 GB of memory shouldn't need any swap at all. With
what I run, if it uses more than the amount of memory that I have, I
tend to stop it and find another computer to run it. If there isn't
one, I need to admit I can't do it. Because once it exceeds the
amount of main memory, it will start thrashing and, thus, take a lot
of time to run. i.e., a day versus a week or more...

On the other hand, we do have servers that double as desktops during
the day. An alternative for you to consider is to only allocate 200
GB of memory to slurm, for example, leaving 56 GB for your own use.
Yes, this means that, at night, 56 GB of RAM is wasted, but during the
day, they can also continue running. Of course, you should set aside
an amount that is enough for you...56 GB was chosen to make my math
easier as well. :-)

If something I said here isn't quite correct, I'm happy to have
someone correct me...

Ray

John Hearns

unread,

Sep 22, 2018, 9:05:43 AM9/22/18

to Slurm User Community List

I would say that, yes, you have a good workflow here with Slurm.
As another aside - is anyone working with suspending and resuming containers?
I see on the Singularity site that suspend/resume in on the roadmap (I
am not talking about checkpointing here).

Also it is worth saying that these days one would be swapping to SSDs
and even better NVRam devices, so the penalties for swapping will be
less.
Warming to my theme what we should be looking at for large memory
machines is tiered memory. Fast DRAM for the actual computations which
are actively being written to. Then slower tiers of cheaper memory.
Diablo had implemented this, I believe they are no longer ctive. lso
there is Optne - which seems to hve gone a bit quiet.
But having read up on Diable the drivers for tiered memory are in the
Linux kernel.

Enough of my ramblings!
MAybe one day you will have a susyem with Tbytes of memory, and only
256 gig of real fast DRAM.

Renfro, Michael

unread,

Sep 22, 2018, 10:36:24 AM9/22/18

to Slurm User Community List

If your workflows are primarily CPU-bound rather than memory-bound, and since you’re the only user, you could ensure all your Slurm scripts ‘nice’ their Python commands, or use the -n flag for slurmd and the PropagatePrioProcess configuration parameter. Both of these are in the thread at https://lists.schedmd.com/pipermail/slurm-users/2018-September/001926.html

--
Mike Renfro / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

Chris Samuel

unread,

Sep 22, 2018, 9:34:19 PM9/22/18

to slurm...@lists.schedmd.com

On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote:

> SLURM's ability to suspend jobs must be storing the state in a
> location outside of this 512 GB. So, you're not helping this by
> allocating more swap.

I don't believe that's the case. My understanding is that in this mode it's
just sending processes SIGSTOP and then launching the incoming job so you
should really have enough swap for the previous job to get swapped out to in
order to free up RAM for the incoming job.

--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Raymond Wan

unread,

Sep 23, 2018, 10:47:32 AM9/23/18

to Slurm User Community List

Hi Chris,

On Sunday, September 23, 2018 09:34 AM, Chris Samuel wrote:
> On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote:
>
>> SLURM's ability to suspend jobs must be storing the state in a
>> location outside of this 512 GB. So, you're not helping this by
>> allocating more swap.
>
> I don't believe that's the case. My understanding is that in this mode it's
> just sending processes SIGSTOP and then launching the incoming job so you
> should really have enough swap for the previous job to get swapped out to in
> order to free up RAM for the incoming job.

Hmmmmmm, I'm way out of my comfort zone but I am curious
about what happens. Unfortunately, I don't think I'm able
to read kernel code, but someone here
(https://stackoverflow.com/questions/31946854/how-does-sigstop-work-in-linux-kernel)
seems to suggest that SIGSTOP and SIGCONT moves a process
between the runnable and waiting queues.

I'm not sure if I did the correct test, but I wrote a C
program that allocated a lot of memory:

-----
#include <stdlib.h>

#define memsize 160000000

int main () {
char *foo = NULL;

foo = (char *) malloc (sizeof (char) * memsize);

for (int i = 0; i < memsize; i++) {
foo[i] = 0;
}

do {
} while (1);
}
-----

Then, I ran it and sent a SIGSTOP to it. According to htop
(I don't know if it's correct), it seems to still be
occupying memory, but just not any CPU cycles.

Perhaps I've done something wrong? I did read elsewhere
that how SIGSTOP is treated can vary from system to
system... I happen to be on an Ubuntu system.

Ray

A

unread,

Sep 23, 2018, 12:46:03 PM9/23/18

to Slurm User Community List

Ray

I'm also on Ubuntu. I'll try the same test, but do it with and without swap on (e.g. by running the swapoff and swapon commands first). To complicate things I also don't know if the swapiness level makes a difference.

Thanks

Ashton

Christopher Samuel

unread,

Sep 23, 2018, 7:35:32 PM9/23/18

to slurm...@lists.schedmd.com

On 24/09/18 00:46, Raymond Wan wrote:

> Hmmmmmm, I'm way out of my comfort zone but I am curious about what
> happens. Unfortunately, I don't think I'm able to read kernel code, but
> someone here
> (https://stackoverflow.com/questions/31946854/how-does-sigstop-work-in-linux-kernel)
> seems to suggest that SIGSTOP and SIGCONT moves a process
> between the runnable and waiting queues.

SIGSTOP is a non-catchable signal that immediately stops a process from
running, and so it will sit there until either resumed, killed or the
system is rebooted. :-)

It's like doing ^Z in the shell (which generates SIGTSTP) but isn't
catchable via signal handlers, so you can't do anything about it (same
as SIGKILL).

Regarding memory, yes its memory is still used until the process
either resume and releases it or is killed. This is why if you want
to do preemption in this mode you'll want swap so that the kernel has
somewhere to page out the memory it's using to for the incoming
process(es).

Hope that helps!

All the best,
Chris

Raymond Wan

unread,

Sep 23, 2018, 10:54:50 PM9/23/18

to Slurm User Community List

Hi Chris,

Ah!!! Yes, this clears things up for me -- thank you! Somehow, I
thought what you meant was that SLURM suspends a job and "immediately"
its state is saved. Then I guessed if SLURM could do that, it ought
to be outside of the main memory + swap space managed by the OS.

But now I see what you mean. It's just doing it within the signal
communication provided by the OS.

The job gets stopped but it remains in main memory. That is, it
doesn't "immediately shift to swap space. But having more swap space
helps to give room for the job to move to so that a currently running
job that is using CPU cycles can run. Of course, if a HPC has enough
main memory to support all suspended jobs and any other programs that
need to be running when the others are suspended, then I also see why
swap space isn't necessary.

Thank you for taking the time to clarify things!

Ray

Reply all

Reply to author

Forward