[slurm-users] Way MaxRSS should be interpreted

E.S. Rosenberg

unread,

Apr 17, 2018, 6:37:33 AM4/17/18

to slurm...@lists.schedmd.com

Hi fellow slurm users,

We have been struggling for a while with understanding how MaxRSS is reported.

This because jobs often die with MaxRSS not even approaching 10% of the requested memory sometimes.

I just found the following document:
https://research.csc.fi/-/a

It says:
"maxrss = maximum amount of memory used at any time by any process in that job. This applies directly for serial jobs. For parallel jobs you need to multiply with the number of cores (max 16 or 24 as this is reported only for that node that used the most memory)"

While 'man sacct' says:
"Maximum resident set size of all tasks in job."

Which explanation is correct? How should I be interpreting MaxRSS?

Thanks,

Eli

Loris Bennett

unread,

Apr 17, 2018, 7:09:40 AM4/17/18

to Slurm User Community List

Hi Eli,

As far as I can tell, both explanations are correct, but the
text in 'man acct' is confusing.

"Maximum resident set size of all tasks in job."

is analogous to

"maximum height of all people in the room"

rather than

"total height of all people in the room"

More specifically it means

"Maximum individual resident set size out of the group of resident set
sizes associated with all tasks in job."

It doesn't mean

"Sum of the resident set sizes of all the tasks"

I'm a native English-speaker and I keep on stumbling over this in 'man
sacct' and then remembering that I have already worked out how it was
supposed to be interpreted.

My suggestion for improving this would be

"Maximum individual resident set size of all resident set sizes
associated with the tasks in job."

It's a little clunky, but I hope it is clearer.

Cheers,

Loris

--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris....@fu-berlin.de

E.S. Rosenberg

unread,

Apr 17, 2018, 8:42:13 AM4/17/18

to Slurm User Community List

Hi Loris,

Thanks for your explanation!

I would have interpreted as max(sum()).

Is there a way to get max(sum()) or at least sum form of sum()? The assumption that all processes are peaking at the same value is not a valid one unless all threads have essentially the same workload...

Thanks again!

Eli

Gareth....@csiro.au

unread,

Apr 17, 2018, 9:04:39 AM4/17/18

to slurm...@lists.schedmd.com

I think the situation is likely to be a little different. Let’s consider a fortran program that statically or dynamically defines large arrays. This defines a virtual memory size – like declaring that this is the maximum amount of memory you might use if you fill the arrays. That amount of real memory + swap must be available for the program to run – after all, you might use that amount… Speaking loosely, linux has a soft memory allocation policy so memory may not actually be allocated until it is used. If the program happens to read a smaller dataset and the arrays are not filled then the resident set size may be significantly smaller than the virtual memory size. Further, memory swapped doesn’t count to the RSS so it might be even smaller. Effectively RSS for a process is the actual footprint in RAM. It will change over the life of the process/job and slurm will track the maximum (MaxRSS). I’d actually expect MaxRSS to be the maximum of the sum of RSS of known processes as sampled periodically through the job – but I’m guessing. This should apply reasonably to parallel jobs if the sum spans nodes (or it wouldn’t be the first batch system to only effectively account for the first allocated node). The whole linux memory tracking/accounting system has gotchas as shared memory (say for library code) has to be accounted for somewhere, but we can reasonably assume in HPC that memory use is dominated by unique computational working set data – so MaxRSS is a good estimate of how much RAM is needed to run a given job.

Gareth

E.S. Rosenberg

unread,

Apr 17, 2018, 9:50:35 AM4/17/18

to Slurm User Community List

Hi Gareth,

Your assessment is also what I would have thought MaxRSS should be the maximum of the sum of all RSS in a sample, swap and shared memory does complicate things but I think most people expect jobs to only be killed if their RSS exceeds their memory request.

That being said as far as I understand the current slurm reporting mechanisms there is actually no way to get the total MaxRSS of a job but only of whatever step/subjob/thread was largest in memory.