SLURM currently supports two alternate memory constraints, per node
and per cpu, but never both together. The guidelines are that node
memory limits should be used when SelectType is linear and cpu limits
should accompany cons_res.
Throughout the code, a single variable (usually job_min_memory or
job_mem) is used to convey the memory constraint and the MEM_PER_CPU
flag is or'd into the value when the variable conveys a memory-per-cpu limit.
I'm unclear as to how you would assign a new step_mem variable in the
slurmd_job_t structure. I would hope that you are not thinking of
adding a new memory step argument to salloc/sbatch/srun. If instead
you derive the step_mem value based on the total memory available on
the node and the number of cpus on the node (and perhaps
job_min_memory and (job_min_memory & MEM_PER_CPU)), then it sounds fine to me.
Don
I might not be understanding your concern, but if there is no other
job step running under the job itself, then there doesn't seem to
be any benefit to setting the *real* job memory constraint (e.g. to
more than the individual job step constraing). It seems
like it is good enough (and in fact required) to ensure on each
job step launch that "job memory constraint" == MAX(job step memory constraint)
However, it certainly wouldn't *hurt* to have this information in
each job step. So I wouldn't be against it.
Ignoring implementation details for a moment, under the example you
present below, I would think that
/dev/cgroup/slurm/uid_500/job_543/memory.limit_in_bytes
would contain the total memory stipulated by the salloc: 2
cpus * 600M
/dev/cgroup/slurm/uid_500/job_543/step_0/memory.limit_in_bytes
would contain the limit for the first srun: 1 cpu * 600M
/dev/cgroup/slurm/uid_500/job_543/step_1/memory.limit_in_bytes
would contain the limit for the second srun to: 1 cpu * 100M
It looks like the above scenario cannot be implemented by using the
existing job_mem member of the slurmd_job_t structure, so I'm also OK
with adding step_mem.
Don
> Concerning alloc_cores, the same logic could be used too. Currently,
> slurmd_job_t alloc_cores parameter is always set to the job map of allocated
> cores regardless the amount requested by the steps. If slurm evolves to
> manage steps resources requirements as it manages job requirements (as it
> seems to do in pre7), we could greatly benefit from a step_alloc_cores too.
> Thus we could easily have :
>
> /dev/cgroup/slurm/uid_500/job_543/cpusets=0-3
> /dev/cgroup/slurm/uid_500/job_543/step_0/cpusets=0
> /dev/cgroup/slurm/uid_500/job_543/step_1/cpusets=1
> /dev/cgroup/slurm/uid_500/job_543/step_2/cpusets=2-3
Ok, I didn't even realize that alloc_cores only applied to the job
and not the job step.
I am total agreement with your statement below. If SLURM is assigning
something to job steps, I don't know why it wouldn't send that information
to slurmd/slurmstepd so it could do something useful with it.
Hopefully this will be considered a bug and will be fixed in 2.1 asap.
BTW, if this is fixed we'll need new calls in spank too to get the
JOB versus the STEP constraints.
Thanks
mark
>
> based on informations directly provided by slurmctld.
>
> IMHO, when slurmctld knows something, it 'd better say it to
> slurmd/slurmstepd rather than letting them guess what they should do by
> themselves. It could prevent them from taking wrong decisions in some cases.
>
> Regards,
> Matthieu
>
> However, it certainly wouldn't *hurt* to have this information in
> > each job step. So I wouldn't be against it.
> >
> > >
> > > I 'm not requesting anything nor starting to add it by myself, I just
> > want
> > > to have your feelings about that. The same logic could be applied too
> > > alloc_cores, if slurmctld has to select and specify steps cores too for
> > > exclusive usage.
> >
> >
>
> --001636283b6e76c409047a2b73e2
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: quoted-printable
>
> <br><br><div class=3D"gmail_quote">2009/12/7 Mark A. Grondona <span dir=3D"=
> ltr"><<a href=3D"mailto:mgro...@llnl.gov" target=3D"_blank">mgrondona@=
> llnl.gov</a>></span><br><blockquote class=3D"gmail_quote" style=3D"borde=
> r-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-le=
> ft: 1ex;">
>
> <div><div></div><div><br>
> > Don,<br>
> ><br>
> > thanks for your reply. My problem is that I would like to add memory<b=
> r>
> > constraints to the jobs/steps I launch inside the proctrack/cgroup.<br=
> >
> ><br>
> > To do that I'm using the job_mem member of slurmd_job_t but it doe=
> s not<br>
> > correspond to the amount of memory a job is supposed to use but to the=
> <br>
> > amount of memory the step is allowed to.<br>
> ><br>
> > The directories hierarchy of the generated cgroups is the following :<=
> br>
> ><br>
> > /dev/cgroup/slurm/uid_500/job_543/step_1/memory.limit_in_bytes<br>
> > /dev/cgroup/slurm/uid_500/job_543/step_0/memory.limit_in_bytes<br>
> > /dev/cgroup/slurm/uid_500/job_543/memory.limit_in_bytes<br>
> > /dev/cgroup/slurm/uid_500/memory.limit_in_bytes<br>
> ><br>
> > step_0 is a backgrounded step that is allowed to use all the job memor=
> y<br>
> > (600Mo per core). step_1 is only allowed to use 100 Mo per core. As a<=
> br>
> > result, when step_1 is started, the proctrack/cgroup set<br>
> > /dev/cgroup/slurm/uid_500/job_543/ memory limit to 100*ncore, thus<br>
> > preventing step_0 to use what it requested.<br>
> ><br>
> > Here is the commands used to launch the steps :<br>
> ><br>
> > [mat@leaf ~]$ salloc --mem-per-cpu 600 -n 2<br>
> > salloc: Granted job allocation 543<br>
> > [mat@leaf ~]$ srun sleep 300 &<br>
> > [1] 3505<br>
> > [mat@leaf ~]$ srun --mem-per-cpu 100 sleep 300 &<br>
> > [2] 3521<br>
> ><br>
> ><br>
> > A workaround could be to first read the limit already set and modify i=
> t only<br>
> > if the new constraint is bigger. However, this outlines that jobstep_m=
> em<br>
> > could be interested to have in slurmd_job_t too. Thus I could use job_=
> mem to<br>
> > set the job container memory constraint and then jobstep_mem to set th=
> e step<br>
> > memory constraint. As --exclusive can now be used to only use a subset=
> of<br>
> > the allocated resources, this could be interesting.<br>
> <br>
> </div></div>I might not be understanding your concern, but if there is no o=
> ther<br>
> job step running under the job itself, then there doesn't seem to<br>
> be any benefit to setting the *real* job memory constraint (e.g. to<br>
> more than the individual job step constraing). It seems<br>
> like it is good enough (and in fact required) to ensure on each<br>
> job step launch that "job memory constraint" =3D=3D MAX(job step =
> memory constraint)<br>
> <br></blockquote><div><br>My problem is to handle generic scenarios where m=
> ultiple simultaneous steps are launched inside an allocation with --exclusi=
> ve usage. The only information concerning memory I have on slurmstepd side =
> is job_mem in slurmd_job_t. This parameter value depends on the way the ste=
> ps are started. If no particular amount of memory is specified, the job all=
> owed amount is used. If a particular amount is requested, this step value w=
> ill be used instead.<br>
>
> <br>Thus, if I just set a step limit using slurmd_job_t.job_mem, the whole =
> job will be able to use multiple times its own limit :<br><br>/dev/cgroup/s=
> lurm/uid_500/job_543/mem_limit=3DNOLIMIT<br>/dev/cgroup/slurm/uid_500/job_5=
> 43/step_0/mem_limit=3D600Mo<br>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=3D600Mo<br>
> <br>So It is necessary to set the job memory limit to cap the allowed consu=
> mption :<br><br>/dev/cgroup/slurm/uid_500/job_543/mem_limit=3D600Mo<br>
> /dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=3DNOLIMIT<br>
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=3DNOLIMIT<br><br>This wo=
> rks well. But if a third simultaneous step is started using a particular me=
> mory request, 100Mo, job_mem value is then 100Mo and the current proctrack/=
> cgroup logic swaps the job_543/mem_limits with 100Mo, potentially killing t=
> he already running steps :<br>
>
> <br>/dev/cgroup/slurm/uid_500/job_543/mem_limit=3D100Mo<br>
>
> /dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=3DNOLIMIT<br>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=3DNOLIMIT<br>
> /dev/cgroup/slurm/uid_500/job_543/step_2/mem_limit=3DNOLIMIT<br>
>
> <br>As I said, a workaround would be to alter job_543 mem_limit only for a =
> higher value, then I would still have :<br><br>/dev/cgroup/slurm/uid_500/jo=
> b_543/mem_limit=3D600Mo<br>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=3DNOLIMIT<br>
>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=3DNOLIMIT<br>
>
> /dev/cgroup/slurm/uid_500/job_543/step_2/mem_limit=3DNOLIMIT<br>
>
>
> <br>I could also combine the two logics and have :<br><br>/dev/cgroup/slurm=
> /uid_500/job_543/mem_limit=3D600Mo<br>
>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_0/mem_limit=3D600Mo<br>
>
>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/mem_limit=3D600Mo<br>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_2/mem_limit=3D100Mo<br>
>
>
>
> <br>Thus I can prevent step_2 from consuming more than the requested amount=
> . cgroup memory constraints can be hierarchical, thus ensuring that the tot=
> al amount of memory used by all the steps will never be higher than the lim=
> it provided to the wrapping job container.<br>
>
>
>
> <br>As I said, I can solve my problem, guessing that a higher value for job=
> _mem means that the step is authorized to use this amount and that the corr=
> esponding job is authorized to use "at least" this amount too. Ho=
> wever this is just a "best guess" solution, and that's the re=
> ason why I'm wondering if adding a deterministic way to get job and job=
> step amounts of allowed memory would be better.<br>
>
> <br>Concerning alloc_cores, the same logic could be used too. Currently, sl=
> urmd_job_t alloc_cores parameter is always set to the job map of allocated =
> cores regardless the amount requested by the steps. If slurm evolves to man=
> age steps resources requirements as it manages job requirements (as it seem=
> s to do in pre7), we could greatly benefit from a step_alloc_cores too. Thu=
> s we could easily have :<br>
>
> <br>/dev/cgroup/slurm/uid_500/job_543/cpusets=3D0-3<br>
>
>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_0/cpusets=3D0<br>
>
>
>
>
>
> /dev/cgroup/slurm/uid_500/job_543/step_1/cpusets=3D1<br>/dev/cgroup/slurm/u=
> id_500/job_543/step_2/cpusets=3D2-3<br>
> <br>based on informations directly provided by slurmctld.<br><br>IMHO, when=
> slurmctld knows something, it 'd better say it to slurmd/slurmstepd ra=
> ther than letting them guess what they should do by themselves. It could pr=
> event them from taking wrong decisions in some cases.<br>
>
> <br>Regards,<br>Matthieu<br><br></div><blockquote class=3D"gmail_quote" sty=
> le=3D"border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex;=
> padding-left: 1ex;">
> However, it certainly wouldn't *hurt* to have this information in<br>
> each job step. So I wouldn't be against it.<br>
> <div><div></div><div><br>
> ><br>
> > I 'm not requesting anything nor starting to add it by myself, I j=
> ust want<br>
> > to have your feelings about that. The same logic could be applied too<=
> br>
> > alloc_cores, if slurmctld has to select and specify steps cores too fo=
> r<br>
> > exclusive usage.<br>
> <br>
> </div></div></blockquote></div><br>
>
> --001636283b6e76c409047a2b73e2--
>