Hi,
in the cluster where I'm deploying Slurm the job allocation has to be based on the actual free memory available on the node, not just the allocated by Slurm. This is nonnegotiable and I understand that it's not how Slurm is designed to work, but I'm trying anyway.
Among the solutions that I'm envisaging:
1) Create and update periodically a numerical node feature, with a string and a special character separating the wanted value (memfree_2048). This definitely seems like a mess to implement and too hacky, but is there an equivalent to PBS' numerical complexes and sensors in Slurm?
2) Modifying the select cons_res pluging to compare against the actual free memory instead of the allocated memory. Is it as simple as editing the "_add_job_to_res" (https://github.com/SchedMD/slurm/blob/master/src/plugins/select/cons_res/select_cons_res.c#L816) function and using the real left memory ? I don't want to break anything else so that's my main question here, if you can guide me towards the solution or other thoughts on its feasibility.
Thanks a lot in advance!
|
Best regards, |
|
|||||
|
|
|||||
|
Hello John, this behavior is needed because the memory usage of the codes executed on the nodes are particularly hard to guess. Usually, when exceeded the ratio is between 1.1 and 1.3 more than expected. Sometimes much larger.
A) Indeed there is a partition running only exclusive jobs, but a large amounts of nodes are also needed working on an nonexclusive allocation. That’s why the exact amount of available memory is required in this configuration. Tasks are not killed if they take more than allocated.
B) Yes currently cgroup is configured and working as expected (I believe), but as I said tasks need to grow larger.
Oftentimes users only input 1G as they really have no idea of the memory requirements, and with the high demand of HPC time a lower memory requirement is set so the job will start.
So a job cannot be started on a node where another job would be filling up the RAM, and would start on another node.
Would this behavior cause problems in the scheduling/allocation algorithms ? The way I see it the actual free memory would be just another consumable resource.
But the only way I can see this working is by tweaking the plugin, correct ?
Thank you for your inputs.
De : slurm-users [mailto:slurm-use...@lists.schedmd.com]
De la part de John Hearns
Envoyé : mardi 29 mai 2018 12:39
À : Slurm User Community List
Objet : Re: [slurm-users] Using free memory available when allocating a node to a job
Hello John, this behavior is needed because the memory usage of the codes executed on the nodes are particularly hard to guess. Usually, when exceeded the ratio is between 1.1 and 1.3 more than expected. Sometimes much larger.
A) Indeed there is a partition running only exclusive jobs, but a large amounts of nodes are also needed working on an nonexclusive allocation. That’s why the exact amount of available memory is required in this configuration. Tasks are not killed if they take more than allocated.
B) Yes currently cgroup is configured and working as expected (I believe), but as I said tasks need to grow larger.
Oftentimes users only input 1G as they really have no idea of the memory requirements, and with the high demand of HPC time a lower memory requirement is set so the job will start.
So a job cannot be started on a node where another job would be filling up the RAM, and would start on another node.
Would this behavior cause problems in the scheduling/allocation algorithms ? The way I see it the actual free memory would be just another consumable resource.
But the only way I can see this working is by tweaking the plugin, correct ?
Thank you for your inputs.
De : slurm-users [mailto:slurm-users-bounces@lists.schedmd.com] De la part de John Hearns
Thanks for your inputs, the automatic reporting is definitely a great idea and seems easy to implement in Slurm. At our site we have a web portal developed internally where users can see in real time everything that is happening on the cluster, and every metric of their own job. There is especially a color code regarding the under/overestimation of memory allocation.
We have constraints, we cannot afford loosing time killing jobs, or performance if a 16G job is allocated to a node where there is only 4 left.
In PBS taking into account the actual free memory as a resource for allocation is a great way to handle this. I find it too bad not to use Slurm’s allocation algorithms and develop another, hacky one with “numerical features” per node.
I’ll admit I’m not comfortable enough editing the cons_res plugin source code, but there doesn’t seem to be another way around for this need.
Regards,
Alexandre
De : slurm-users [mailto:slurm-use...@lists.schedmd.com]
De la part de John Hearns
Envoyé : mardi 29 mai 2018 13:16
À : Slurm User Community List
Objet : Re: [slurm-users] Using free memory available when allocating a node to a job
Alexandre, you have made a very good point here. "Oftentimes users only input 1G as they really have no idea of the memory requirements,"
At my last job we introduced cgroups. (this was in PBSPro). We had to enforce a minumum request for memory.
Users then asked us how much memory their jobs used - so that they could request an amoutn of memory next time which would let the job run to completion.
We were giving users information manually regarding how much memory their jobs used.
I realise tha tthe tools are there for users to get the information on memory usage after a job, but I really do not expec tusrs to have to figure this out.
What do other sites do in this case?
On 29 May 2018 at 12:57, PULIDO, Alexandre <alexandr...@ariane.group> wrote:
Hello John, this behavior is needed because the memory usage of the codes executed on the nodes are particularly hard to guess. Usually, when exceeded the ratio is between 1.1 and 1.3 more than expected. Sometimes much larger.
A) Indeed there is a partition running only exclusive jobs, but a large amounts of nodes are also needed working on an nonexclusive allocation. That’s why the exact amount of available memory is required in this configuration. Tasks are not killed if they take more than allocated.
B) Yes currently cgroup is configured and working as expected (I believe), but as I said tasks need to grow larger.
Oftentimes users only input 1G as they really have no idea of the memory requirements, and with the high demand of HPC time a lower memory requirement is set so the job will start.
So a job cannot be started on a node where another job would be filling up the RAM, and would start on another node.
Would this behavior cause problems in the scheduling/allocation algorithms ? The way I see it the actual free memory would be just another consumable resource.
But the only way I can see this working is by tweaking the plugin, correct ?
Thank you for your inputs.
De : slurm-users [mailto:slurm-use...@lists.schedmd.com] De la part de John Hearns
On thing that seems concerning to me is that you may start a job on a node before a currently running job has 'expanded' as much as it will.
If there is 128G on the node and current job is using 64G but will eventually use 112G, your approach could start another similar job and they would both start swapping.
We had always pushed the users to know what they need before they submit a job. They can ask for too much and then go down from there, but it is really their responsibility to know what their program will do. You are giving them the keys to a Tesla and they want to blame you if they put the pedal to the metal and crash. Learn the tools before you use them.
Brian Andrus