[slurm-dev] job step memory limits in proctrack plugin

matthieu hautreux

unread,

Nov 24, 2009, 3:31:21 PM11/24/09

to slurm-dev

Hi,

I'm currently working on a proctrack/cgroup plugin with slurm-2.1.0-0.pre7. I would like to get memory limits for both job and jobstep to set according constraints in corresponding cgroup directories "/dev/cgroup/slurm/job_%jobid" and "/dev/cgroup/slurm/job_%jobid/step_%stepid" if required by conf (the path are just examples of path).
As those value seems now to be distinguished, having two separate values in slurmd_job_t could be interesting. For example :

typedef struct slurmd_job {
...
uint32_t job_mem; /* MB of memory reserved for the job */
uint32_t step_mem; /* MB of memory reserved for the job step */
...
} slurmd_job_t;

Does this make sense to you ?

Regards,
Matthieu

Don Lipari

unread,

Dec 4, 2009, 6:37:36 PM12/4/09

to slur...@lists.llnl.gov

Matthieu,

SLURM currently supports two alternate memory constraints, per node
and per cpu, but never both together. The guidelines are that node
memory limits should be used when SelectType is linear and cpu limits
should accompany cons_res.

Throughout the code, a single variable (usually job_min_memory or
job_mem) is used to convey the memory constraint and the MEM_PER_CPU
flag is or'd into the value when the variable conveys a memory-per-cpu limit.

I'm unclear as to how you would assign a new step_mem variable in the
slurmd_job_t structure. I would hope that you are not thinking of
adding a new memory step argument to salloc/sbatch/srun. If instead
you derive the step_mem value based on the total memory available on
the node and the number of cpus on the node (and perhaps
job_min_memory and (job_min_memory & MEM_PER_CPU)), then it sounds fine to me.

Don

matthieu hautreux

unread,

Dec 7, 2009, 4:14:56 PM12/7/09

to slur...@lists.llnl.gov

Don,

thanks for your reply. My problem is that I would like to add memory constraints to the jobs/steps I launch inside the proctrack/cgroup.

To do that I'm using the job_mem member of slurmd_job_t but it does not correspond to the amount of memory a job is supposed to use but to the amount of memory the step is allowed to.

The directories hierarchy of the generated cgroups is the following :

/dev/cgroup/slurm/uid_500/job_543/step_1/memory.limit_in_bytes
/dev/cgroup/slurm/uid_500/job_543/step_0/memory.limit_in_bytes
/dev/cgroup/slurm/uid_500/job_543/memory.limit_in_bytes
/dev/cgroup/slurm/uid_500/memory.limit_in_bytes

step_0 is a backgrounded step that is allowed to use all the job memory (600Mo per core). step_1 is only allowed to use 100 Mo per core. As a result, when step_1 is started, the proctrack/cgroup set /dev/cgroup/slurm/uid_500/job_543/ memory limit to 100*ncore, thus preventing step_0 to use what it requested.

Here is the commands used to launch the steps :

[mat@leaf ~]$ salloc --mem-per-cpu 600 -n 2
salloc: Granted job allocation 543
[mat@leaf ~]$ srun sleep 300 &
[1] 3505
[mat@leaf ~]$ srun --mem-per-cpu 100 sleep 300 &
[2] 3521

A workaround could be to first read the limit already set and modify it only if the new constraint is bigger. However, this outlines that jobstep_mem could be interested to have in slurmd_job_t too. Thus I could use job_mem to set the job container memory constraint and then jobstep_mem to set the step memory constraint. As --exclusive can now be used to only use a subset of the allocated resources, this could be interesting.

I 'm not requesting anything nor starting to add it by myself, I just want to have your feelings about that. The same logic could be applied too alloc_cores, if slurmctld has to select and specify steps cores too for exclusive usage.

Regards,
Matthieu

2009/12/5 Don Lipari <Lip...@llnl.gov>

Mark A. Grondona

unread,

Dec 7, 2009, 4:35:31 PM12/7/09

to slur...@lists.llnl.gov

I might not be understanding your concern, but if there is no other
job step running under the job itself, then there doesn't seem to
be any benefit to setting the *real* job memory constraint (e.g. to
more than the individual job step constraing). It seems
like it is good enough (and in fact required) to ensure on each
job step launch that "job memory constraint" == MAX(job step memory constraint)

However, it certainly wouldn't *hurt* to have this information in
each job step. So I wouldn't be against it.

matthieu hautreux

unread,

Dec 7, 2009, 6:02:26 PM12/7/09

to slur...@lists.llnl.gov

2009/12/7 Mark A. Grondona <mgro...@llnl.gov>

My problem is to handle generic scenarios where multiple simultaneous steps are launched inside an allocation with --exclusive usage. The only information concerning memory I have on slurmstepd side is job_mem in slurmd_job_t. This parameter value depends on the way the steps are started. If no particular amount of memory is specified, the job allowed amount is used. If a particular amount is requested, this step value will be used instead.

Thus, if I just set a step limit using slurmd_job_t.job_mem, the whole job will be able to use multiple times its own limit :

/dev/cgroup/slurm/uid_500/job_543/mem_limit=NOLIMIT
/dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=600Mo
/dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=600Mo

So It is necessary to set the job memory limit to cap the allowed consumption :

/dev/cgroup/slurm/uid_500/job_543/mem_limit=600Mo
/dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=NOLIMIT
/dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=NOLIMIT

This works well. But if a third simultaneous step is started using a particular memory request, 100Mo, job_mem value is then 100Mo and the current proctrack/cgroup logic swaps the job_543/mem_limits with 100Mo, potentially killing the already running steps :

/dev/cgroup/slurm/uid_500/job_543/mem_limit=100Mo
/dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=NOLIMIT
/dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=NOLIMIT
/dev/cgroup/slurm/uid_500/job_543/step_2/mem_limit=NOLIMIT

As I said, a workaround would be to alter job_543 mem_limit only for a higher value, then I would still have :

/dev/cgroup/slurm/uid_500/job_543/mem_limit=600Mo
/dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=NOLIMIT
/dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=NOLIMIT
/dev/cgroup/slurm/uid_500/job_543/step_2/mem_limit=NOLIMIT

I could also combine the two logics and have :

/dev/cgroup/slurm/uid_500/job_543/mem_limit=600Mo
/dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=600Mo
/dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=600Mo
/dev/cgroup/slurm/uid_500/job_543/step_2/mem_limit=100Mo

Thus I can prevent step_2 from consuming more than the requested amount. cgroup memory constraints can be hierarchical, thus ensuring that the total amount of memory used by all the steps will never be higher than the limit provided to the wrapping job container.

As I said, I can solve my problem, guessing that a higher value for job_mem means that the step is authorized to use this amount and that the corresponding job is authorized to use "at least" this amount too. However this is just a "best guess" solution, and that's the reason why I'm wondering if adding a deterministic way to get job and job step amounts of allowed memory would be better.

Concerning alloc_cores, the same logic could be used too. Currently, slurmd_job_t alloc_cores parameter is always set to the job map of allocated cores regardless the amount requested by the steps. If slurm evolves to manage steps resources requirements as it manages job requirements (as it seems to do in pre7), we could greatly benefit from a step_alloc_cores too. Thus we could easily have :

/dev/cgroup/slurm/uid_500/job_543/cpusets=0-3
/dev/cgroup/slurm/uid_500/job_543/step_0/cpusets=0
/dev/cgroup/slurm/uid_500/job_543/step_1/cpusets=1
/dev/cgroup/slurm/uid_500/job_543/step_2/cpusets=2-3

based on informations directly provided by slurmctld.

IMHO, when slurmctld knows something, it 'd better say it to slurmd/slurmstepd rather than letting them guess what they should do by themselves. It could prevent them from taking wrong decisions in some cases.

Regards,
Matthieu

Don Lipari

unread,

Dec 7, 2009, 6:17:02 PM12/7/09

to slur...@lists.llnl.gov

Matthieu,

Ignoring implementation details for a moment, under the example you
present below, I would think that

/dev/cgroup/slurm/uid_500/job_543/memory.limit_in_bytes
would contain the total memory stipulated by the salloc: 2
cpus * 600M

/dev/cgroup/slurm/uid_500/job_543/step_0/memory.limit_in_bytes
would contain the limit for the first srun: 1 cpu * 600M

/dev/cgroup/slurm/uid_500/job_543/step_1/memory.limit_in_bytes
would contain the limit for the second srun to: 1 cpu * 100M

It looks like the above scenario cannot be implemented by using the
existing job_mem member of the slurmd_job_t structure, so I'm also OK
with adding step_mem.

Don

Don Lipari

unread,

Dec 7, 2009, 6:23:29 PM12/7/09

to slur...@lists.llnl.gov

Matthieu,

So I read the additional details below as support for the adding step_mem to slurmd_job_t and avoiding the messy workaround and having to guess. I'm fine with the proposal.

Don

Mark A. Grondona

unread,

Dec 7, 2009, 6:57:19 PM12/7/09

to slur...@lists.llnl.gov

> Concerning alloc_cores, the same logic could be used too. Currently,
> slurmd_job_t alloc_cores parameter is always set to the job map of allocated
> cores regardless the amount requested by the steps. If slurm evolves to
> manage steps resources requirements as it manages job requirements (as it
> seems to do in pre7), we could greatly benefit from a step_alloc_cores too.
> Thus we could easily have :
>
> /dev/cgroup/slurm/uid_500/job_543/cpusets=0-3
> /dev/cgroup/slurm/uid_500/job_543/step_0/cpusets=0
> /dev/cgroup/slurm/uid_500/job_543/step_1/cpusets=1
> /dev/cgroup/slurm/uid_500/job_543/step_2/cpusets=2-3

Ok, I didn't even realize that alloc_cores only applied to the job
and not the job step.

I am total agreement with your statement below. If SLURM is assigning
something to job steps, I don't know why it wouldn't send that information
to slurmd/slurmstepd so it could do something useful with it.

Hopefully this will be considered a bug and will be fixed in 2.1 asap.

BTW, if this is fixed we'll need new calls in spank too to get the
JOB versus the STEP constraints.

Thanks
mark

>
> based on informations directly provided by slurmctld.
>
> IMHO, when slurmctld knows something, it 'd better say it to
> slurmd/slurmstepd rather than letting them guess what they should do by
> themselves. It could prevent them from taking wrong decisions in some cases.
>
> Regards,
> Matthieu
>
> However, it certainly wouldn't *hurt* to have this information in
> > each job step. So I wouldn't be against it.
> >
> > >
> > > I 'm not requesting anything nor starting to add it by myself, I just
> > want
> > > to have your feelings about that. The same logic could be applied too
> > > alloc_cores, if slurmctld has to select and specify steps cores too for
> > > exclusive usage.
> >
> >
>

> --001636283b6e76c409047a2b73e2
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: quoted-printable
>
> <div class=3D"gmail_quote">2009/12/7 Mark A. Grondona ltr"><<a href=3D"mailto:mgro...@llnl.gov" target=3D"_blank">mgrondona@=
> llnl.gov</a>> <blockquote class=3D"gmail_quote" style=3D"borde=
> r-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-le=
> ft: 1ex;">
>
> <div><div></div><div> 
> > Don, 
> > 
> > thanks for your reply. My problem is that I would like to add memory<b=
> r>
> > constraints to the jobs/steps I launch inside the proctrack/cgroup.<br=
> >
> > 
> > To do that I'm using the job_mem member of slurmd_job_t but it doe=
> s not 
> > correspond to the amount of memory a job is supposed to use but to the=
> 
> > amount of memory the step is allowed to. 
> > 
> > The directories hierarchy of the generated cgroups is the following :<=
> br>
> > 
> > /dev/cgroup/slurm/uid_500/job_543/step_1/memory.limit_in_bytes 
> > /dev/cgroup/slurm/uid_500/job_543/step_0/memory.limit_in_bytes 
> > /dev/cgroup/slurm/uid_500/job_543/memory.limit_in_bytes 
> > /dev/cgroup/slurm/uid_500/memory.limit_in_bytes 
> > 
> > step_0 is a backgrounded step that is allowed to use all the job memor=
> y 
> > (600Mo per core). step_1 is only allowed to use 100 Mo per core. As a<=
> br>
> > result, when step_1 is started, the proctrack/cgroup set 
> > /dev/cgroup/slurm/uid_500/job_543/ memory limit to 100*ncore, thus 
> > preventing step_0 to use what it requested. 
> > 
> > Here is the commands used to launch the steps : 
> > 
> > [mat@leaf ~]$ salloc --mem-per-cpu 600 -n 2 
> > salloc: Granted job allocation 543 
> > [mat@leaf ~]$ srun sleep 300 & 
> > [1] 3505 
> > [mat@leaf ~]$ srun --mem-per-cpu 100 sleep 300 & 
> > [2] 3521 
> > 
> > 
> > A workaround could be to first read the limit already set and modify i=
> t only 
> > if the new constraint is bigger. However, this outlines that jobstep_m=
> em 
> > could be interested to have in slurmd_job_t too. Thus I could use job_=
> mem to 
> > set the job container memory constraint and then jobstep_mem to set th=
> e step 
> > memory constraint. As --exclusive can now be used to only use a subset=
> of 
> > the allocated resources, this could be interesting. 
> 
> </div></div>I might not be understanding your concern, but if there is no o=
> ther 
> job step running under the job itself, then there doesn't seem to 
> be any benefit to setting the *real* job memory constraint (e.g. to 
> more than the individual job step constraing). It seems 
> like it is good enough (and in fact required) to ensure on each 
> job step launch that "job memory constraint" =3D=3D MAX(job step =
> memory constraint) 
> </blockquote><div> My problem is to handle generic scenarios where m=
> ultiple simultaneous steps are launched inside an allocation with --exclusi=
> ve usage. The only information concerning memory I have on slurmstepd side =
> is job_mem in slurmd_job_t. This parameter value depends on the way the ste=
> ps are started. If no particular amount of memory is specified, the job all=
> owed amount is used. If a particular amount is requested, this step value w=
> ill be used instead. 
>
> Thus, if I just set a step limit using slurmd_job_t.job_mem, the whole =
> job will be able to use multiple times its own limit : /dev/cgroup/s=
> lurm/uid_500/job_543/mem_limit=3DNOLIMIT /dev/cgroup/slurm/uid_500/job_5=
> 43/step_0/mem_limit=3D600Mo 
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=3D600Mo 
> So It is necessary to set the job memory limit to cap the allowed consu=
> mption : /dev/cgroup/slurm/uid_500/job_543/mem_limit=3D600Mo 
> /dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=3DNOLIMIT 
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=3DNOLIMIT This wo=
> rks well. But if a third simultaneous step is started using a particular me=
> mory request, 100Mo, job_mem value is then 100Mo and the current proctrack/=
> cgroup logic swaps the job_543/mem_limits with 100Mo, potentially killing t=
> he already running steps : 
>
> /dev/cgroup/slurm/uid_500/job_543/mem_limit=3D100Mo 
>
> /dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=3DNOLIMIT 
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=3DNOLIMIT 
> /dev/cgroup/slurm/uid_500/job_543/step_2/mem_limit=3DNOLIMIT 
>
> As I said, a workaround would be to alter job_543 mem_limit only for a =
> higher value, then I would still have : /dev/cgroup/slurm/uid_500/jo=
> b_543/mem_limit=3D600Mo 
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=3DNOLIMIT 
>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=3DNOLIMIT 
>
> /dev/cgroup/slurm/uid_500/job_543/step_2/mem_limit=3DNOLIMIT 
>
>
> I could also combine the two logics and have : /dev/cgroup/slurm=
> /uid_500/job_543/mem_limit=3D600Mo 
>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=3D600Mo 
>
>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=3D600Mo 
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_2/mem_limit=3D100Mo 
>
>
>
> Thus I can prevent step_2 from consuming more than the requested amount=
> . cgroup memory constraints can be hierarchical, thus ensuring that the tot=
> al amount of memory used by all the steps will never be higher than the lim=
> it provided to the wrapping job container. 
>
>
>
> As I said, I can solve my problem, guessing that a higher value for job=
> _mem means that the step is authorized to use this amount and that the corr=
> esponding job is authorized to use "at least" this amount too. Ho=
> wever this is just a "best guess" solution, and that's the re=
> ason why I'm wondering if adding a deterministic way to get job and job=
> step amounts of allowed memory would be better. 
>
> Concerning alloc_cores, the same logic could be used too. Currently, sl=
> urmd_job_t alloc_cores parameter is always set to the job map of allocated =
> cores regardless the amount requested by the steps. If slurm evolves to man=
> age steps resources requirements as it manages job requirements (as it seem=
> s to do in pre7), we could greatly benefit from a step_alloc_cores too. Thu=
> s we could easily have : 
>
> /dev/cgroup/slurm/uid_500/job_543/cpusets=3D0-3 
>
>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_0/cpusets=3D0 
>
>
>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/cpusets=3D1 /dev/cgroup/slurm/u=
> id_500/job_543/step_2/cpusets=3D2-3 
> based on informations directly provided by slurmctld. IMHO, when=
> slurmctld knows something, it 'd better say it to slurmd/slurmstepd ra=
> ther than letting them guess what they should do by themselves. It could pr=
> event them from taking wrong decisions in some cases. 
>
> Regards, Matthieu </div><blockquote class=3D"gmail_quote" sty=
> le=3D"border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex;=
> padding-left: 1ex;">
> However, it certainly wouldn't *hurt* to have this information in 
> each job step. So I wouldn't be against it. 
> <div><div></div><div> 
> > 
> > I 'm not requesting anything nor starting to add it by myself, I j=
> ust want 
> > to have your feelings about that. The same logic could be applied too<=
> br>
> > alloc_cores, if slurmctld has to select and specify steps cores too fo=
> r 
> > exclusive usage. 
> 
> </div></div></blockquote></div> 
>
> --001636283b6e76c409047a2b73e2--
>

matthieu hautreux

unread,

Dec 9, 2009, 4:31:04 PM12/9/09

to slur...@lists.llnl.gov

Thank you Don.

So what is the next step ? Do you think that you will be able to add it or are you just ok with the concept but would prefer let me do the work ? I do not have too much time now to do that but I could put it on my slurm todo list for 2010 if necessary.

Thanks again,
Matthieu

2009/12/8 Don Lipari <Lip...@llnl.gov>

Don Lipari

unread,

Dec 9, 2009, 4:43:58 PM12/9/09

to slur...@lists.llnl.gov

Matthieu,

We're trying to wind down any feature enhancements to SLURM 2.1 in preparation for a formal release within a month or so. While adding a couple new members to slurmd_job_t is innocuous enough, finding the right places to place the assignments will require some study.

So, we will consider adding it to 2.1 if time permits. If we can't do it within a month or two, you're welcome to take on the task.

Thank you,
Don

matthieu hautreux

unread,

Dec 9, 2009, 4:59:46 PM12/9/09

to slur...@lists.llnl.gov

Ok, let me know in some weeks if you need my help for that. I will implement the discussed workaround in my plugin in the meantime.

Thanks,
Matthieu

2009/12/9 Don Lipari <Lip...@llnl.gov>

Reply all

Reply to author

Forward