It sounds like your confusing job steps and tasks. For an MPI
program, tasks and MPI ranks are the same thing. A slurm job has
multiple steps. A single job step could have only 1 task, while
another step in the same job can use 1,000 tasks. When looking at
the amount of memory for a job, the important number is the
largest value of MaxRSS for all the job steps. Why this important?
Because if you don't request at least this much with your --mem
specification, your job may fail.
Based on your definition, of aveRSS (I didn't go back and check
the documentation myself), it sounds like you're doing unnecessary
math, since I'm sure Slurm sums up the individual task max. RSS
values for each task to get MaxRSS, and then divides that by the
number of tasks to get the AveRSS.
What you want is the MaxRSS for the job step with the largest
value of MaxRSS. For example, here's a parallel job I ran earlier
today:
$ sacct -u pbisbal -o jobid,jobname,MaxRSS,AveRSS
JobID JobName MaxRSS AveRSS
------------ ---------- ---------- ----------
1100800 mcnp_test
1100800.bat+ batch 20999632K 20999632K
1100800.ext+ extern 1060K 964K
1100800.0 orted 24014384K 9238477482
The real "memory" for this entire job would be 24014384K
Prentice
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts
It sounds like your confusing job steps and tasks. For an MPI program, tasks and MPI ranks are the same thing. A slurm job has multiple steps. A single job step could have only 1 task, while another step in the same job can use 1,000 tasks. When looking at the amount of memory for a job, the important number is the largest value of MaxRSS for all the job steps. Why this important? Because if you don't request at least this much with your --mem specification, your job may fail.
Based on your definition, of aveRSS (I didn't go back and check the documentation myself), it sounds like you're doing unnecessary math, since I'm sure Slurm sums up the individual task max. RSS values for each task to get MaxRSS, and then divides that by the number of tasks to get the AveRSS.
On Sat, 13 Mar 2021 at 08:48, Prentice Bisbal <pbi...@pppl.gov> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts
It sounds like your confusing job steps and tasks. For an MPI program, tasks and MPI ranks are the same thing. A slurm job has multiple steps. A single job step could have only 1 task, while another step in the same job can use 1,000 tasks. When looking at the amount of memory for a job, the important number is the largest value of MaxRSS for all the job steps. Why this important? Because if you don't request at least this much with your --mem specification, your job may fail.
Based on your definition, of aveRSS (I didn't go back and check the documentation myself), it sounds like you're doing unnecessary math, since I'm sure Slurm sums up the individual task max. RSS values for each task to get MaxRSS, and then divides that by the number of tasks to get the AveRSS.
This is incorrect. MaxRSS is the maximum amount of RAM the task that used the most amount of RAM used. That is why there is then a MaxRSSNode and MaxRSSTask value. MaxRSSNode is the node the task that used the most amount of RAM was on, and MaxRSSTask is the task ID of the task that used the most amount of RAM.
Thanks for the correction. That's what I originally thought, and then read the definition he provided, which is exactly the same as in the documentation, and completely misinterpreted it. When I look at the sacct documentation and see that same definition in the context of all the all the other MaxRSS values, it's clear I screwed up. Sorry!
SchedMD should reword that so even out of context it's clear what
it represents.
When I read "Maximum resident set size of all tasks in job" I automatically thought "Maximum of the *sum* of the RSSes of each task.
Prentice