[slurm-dev] Fwd: Scheduling jobs according to the CPU load

kesim

unread,

Mar 16, 2017, 12:55:11 PM3/16/17

to slurm-dev

---------- Forwarded message ----------
From: kesim <keti...@gmail.com>
Date: Thu, Mar 16, 2017 at 5:50 PM
Subject: Scheduling jobs according to the CPU load
To: slur...@schedmd.com

Hi all,

I am a new user and I created a small network of 11 nodes 7 CPUs per node out of users desktops.

I configured slurm as:

SelectType=select/cons_res

SelectTypeParameters=CR_CPU

When I submit a task with srun -n70 task

It will fill 10 nodes with 7 tasks/node. However, I have no clue what is the algorithm of choosing the nodes. Users run programs on the nodes and some nodes are more busy than others. It seems logical that the scheduler should submit the tasks to the less busy nodes but it is not the case.

In the sinfo -N -o '%N %O %C' I can see that the jobs are allocated to the node11 with the load 2.06 leaving the node4 which is totally idling. That somehow make no sense to me.

node1 0.00 7/0/0/7

node2 0.26 7/0/0/7

node3 0.54 7/0/0/7

node4 0.07 0/7/0/7

node5 0.00 7/0/0/7

node6 0.01 7/0/0/7

node7 0.00 7/0/0/7

node8 0.01 7/0/0/7

node9 0.06 7/0/0/7

node10 0.11 7/0/0/7

node11 2.06 7/0/0/7

How can I configure slurm to be able to fill the node with minimum load first?

Paul Edmon

unread,

Mar 16, 2017, 1:25:31 PM3/16/17

to slurm-dev

You should look at LLN (least loaded nodes):

https://slurm.schedmd.com/slurm.conf.html

That should do what you want.

-Paul Edmon-

kesim

unread,

Mar 16, 2017, 2:54:04 PM3/16/17

to slurm-dev

Than you for great suggestion. It is working! However the description of CR_LLN is misleading "Schedule resources to jobs on the least loaded nodes (based upon the number of idle CPUs)" Which I understood that if the two nodes has not fully allocated CPUs the node with smaller number of allocated CPUs will take precedence. Therefore the bracketed comment should be removed from the description.

kesim

unread,

Mar 17, 2017, 5:33:32 AM3/17/17

to slurm-dev

Dear All,

Yesterday I did some tests and it seemed that the scheduling is following CPU load but I was wrong.

My configuration is at the moment:

SelectType=select/cons_res

SelectTypeParameters=CR_CPU,CR_LLN

Today I submitted 70 threaded jobs to the queue and here is the CPU_LOAD info

node1 0.08 7/0/0/7

node2 0.01 7/0/0/7

node3 0.00 7/0/0/7

node4 2.97 7/0/0/7

node5 0.00 7/0/0/7

node6 0.01 7/0/0/7

node7 0.00 7/0/0/7

node8 0.05 7/0/0/7

node9 0.07 7/0/0/7

node10 0.38 7/0/0/7

node11 0.01 0/7/0/7

As you can see it allocated 7 CPUs on node 4 with CPU_LOAD 2.97 and 0 CPUs on idling node11. Why such simple thing is not a default? What am I missing???

kesim

unread,

Mar 18, 2017, 12:17:13 PM3/18/17

to slurm-dev

Unbelievable but it seems that nobody knows how to do that. It is astonishing that such sophisticated system fails with such simple problem. The slurm is aware about the cpu load of non-slurm jobs but it does not use the info. My original understanding of LLN was apparently correct. I can practically kill the CPUs on particular node with nonslurm tasks but slurm will diligently submit 7 jobs to this node leaving other idling. I consider this as a serious bug of this program.

John Hearns

unread,

Mar 18, 2017, 12:43:30 PM3/18/17

to slurm-dev

Kesim,

what you are saying is that Slurm schedukes tasks based on the number of allocated CPUs. Rather than the actual load factor on the server.
As I recall Gridengine actually used the load factor.

However you comment that "users run programs on the nodes" and "the slurm is aware about the load of non-slurm jobs"
IMHO, in any well-run HPC setup any user running jobs without using the scheduler would have their fingers broken. or at least bruised using the clue stick.

Seriously, three points:

a) tell users to use 'salloc' and 'srun' to run interactive jobs. They can easily open a Bash session on a compute node and do what they like. Under the Slurm scheduler.

b) implement the pam-slurm PAMmodule. It is a few minutes work. This means your users cannot go behind the sluem scheduler and log into the nodes

c) on Bright clusters which I configure, you have a healtcheck running which wans you when a user is detected as logging in withotu using Slurm

Seriously again. You have implemented an HPC infrastructure, and have gone to the time and effort to implement a batch scheduling system.
A batch scheduler can be adapted to let your users do their jobs, including interactive shell sessions and remote visualization sessions.
Do not let the users ride roughshod over you.

________________________________________
From: kesim [keti...@gmail.com]
Sent: 18 March 2017 16:16
To: slurm-dev
Subject: [slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

Unbelievable but it seems that nobody knows how to do that. It is astonishing that such sophisticated system fails with such simple problem. The slurm is aware about the cpu load of non-slurm jobs but it does not use the info. My original understanding of LLN was apparently correct. I can practically kill the CPUs on particular node with nonslurm tasks but slurm will diligently submit 7 jobs to this node leaving other idling. I consider this as a serious bug of this program.

On Fri, Mar 17, 2017 at 10:32 AM, kesim <keti...@gmail.com<mailto:keti...@gmail.com>> wrote:
Dear All,
Yesterday I did some tests and it seemed that the scheduling is following CPU load but I was wrong.
My configuration is at the moment:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU,CR_LLN

Today I submitted 70 threaded jobs to the queue and here is the CPU_LOAD info
node1 0.08 7/0/0/7
node2 0.01 7/0/0/7
node3 0.00 7/0/0/7
node4 2.97 7/0/0/7
node5 0.00 7/0/0/7
node6 0.01 7/0/0/7
node7 0.00 7/0/0/7
node8 0.05 7/0/0/7
node9 0.07 7/0/0/7
node10 0.38 7/0/0/7
node11 0.01 0/7/0/7
As you can see it allocated 7 CPUs on node 4 with CPU_LOAD 2.97 and 0 CPUs on idling node11. Why such simple thing is not a default? What am I missing???

On Thu, Mar 16, 2017 at 7:53 PM, kesim <keti...@gmail.com<mailto:keti...@gmail.com>> wrote:
Than you for great suggestion. It is working! However the description of CR_LLN is misleading "Schedule resources to jobs on the least loaded nodes (based upon the number of idle CPUs)" Which I understood that if the two nodes has not fully allocated CPUs the node with smaller number of allocated CPUs will take precedence. Therefore the bracketed comment should be removed from the description.

Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of XMA Ltd are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP

kesim

unread,

Mar 18, 2017, 2:07:17 PM3/18/17

to slurm-dev

Dear John,

Thank you for your answer. Obviously you are right that I could slurm up everything and thus avoid the issue and your points are taken. However, I still insist that it is a serious bug not to take into account the actual CPU load when the scheduler submit a job regardless whose fault it is that a non-slurm job is running. I would not suspect that from even simplest scheduler and if I had such prior knowledge I would not invest so much time and effort to setup slurm.

Best regards,

Ketiw

On Sat, Mar 18, 2017 at 5:42 PM, John Hearns <John....@xma.co.uk> wrote:

Kesim,

what you are saying is that Slurm schedukes tasks based on the number of allocated CPUs. Rather than the actual load factor on the server.
As I recall Gridengine actually used the load factor.

However you comment that "users run programs on the nodes" and "the slurm is aware about the load of non-slurm jobs"
IMHO, in any well-run HPC setup any user running jobs without using the scheduler would have their fingers broken. or at least bruised using the clue stick.

Seriously, three points:

a) tell users to use 'salloc' and 'srun' to run interactive jobs. They can easily open a Bash session on a compute node and do what they like. Under the Slurm scheduler.

b) implement the pam-slurm PAMmodule. It is a few minutes work. This means your users cannot go behind the sluem scheduler and log into the nodes

c) on Bright clusters which I configure, you have a healtcheck running which wans you when a user is detected as logging in withotu using Slurm

Seriously again. You have implemented an HPC infrastructure, and have gone to the time and effort to implement a batch scheduling system.
A batch scheduler can be adapted to let your users do their jobs, including interactive shell sessions and remote visualization sessions.
Do not let the users ride roughshod over you.

________________________________________
From: kesim [keti...@gmail.com]
Sent: 18 March 2017 16:16
To: slurm-dev
Subject: [slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

Unbelievable but it seems that nobody knows how to do that. It is astonishing that such sophisticated system fails with such simple problem. The slurm is aware about the cpu load of non-slurm jobs but it does not use the info. My original understanding of LLN was apparently correct. I can practically kill the CPUs on particular node with nonslurm tasks but slurm will diligently submit 7 jobs to this node leaving other idling. I consider this as a serious bug of this program.

On Fri, Mar 17, 2017 at 10:32 AM, kesim <keti...@gmail.com<mailto:ketiw...@gmail.com>> wrote:
Dear All,
Yesterday I did some tests and it seemed that the scheduling is following CPU load but I was wrong.
My configuration is at the moment:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU,CR_LLN

Today I submitted 70 threaded jobs to the queue and here is the CPU_LOAD info
node1 0.08 7/0/0/7
node2 0.01 7/0/0/7
node3 0.00 7/0/0/7
node4 2.97 7/0/0/7
node5 0.00 7/0/0/7
node6 0.01 7/0/0/7
node7 0.00 7/0/0/7
node8 0.05 7/0/0/7
node9 0.07 7/0/0/7
node10 0.38 7/0/0/7
node11 0.01 0/7/0/7
As you can see it allocated 7 CPUs on node 4 with CPU_LOAD 2.97 and 0 CPUs on idling node11. Why such simple thing is not a default? What am I missing???

John Hearns

unread,

Mar 18, 2017, 3:06:50 PM3/18/17

to slurm-dev

Kesim,
Touche Sir. I agree with you.

________________________________________
From: kesim [keti...@gmail.com]
Sent: 18 March 2017 18:06

To: slurm-dev
Subject: [slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

Dear John,

Thank you for your answer. Obviously you are right that I could slurm up everything and thus avoid the issue and your points are taken. However, I still insist that it is a serious bug not to take into account the actual CPU load when the scheduler submit a job regardless whose fault it is that a non-slurm job is running. I would not suspect that from even simplest scheduler and if I had such prior knowledge I would not invest so much time and effort to setup slurm.
Best regards,

Ketiw

On Sat, Mar 18, 2017 at 5:42 PM, John Hearns <John....@xma.co.uk<mailto:John....@xma.co.uk>> wrote:

Kesim,

what you are saying is that Slurm schedukes tasks based on the number of allocated CPUs. Rather than the actual load factor on the server.
As I recall Gridengine actually used the load factor.

However you comment that "users run programs on the nodes" and "the slurm is aware about the load of non-slurm jobs"
IMHO, in any well-run HPC setup any user running jobs without using the scheduler would have their fingers broken. or at least bruised using the clue stick.

Seriously, three points:

a) tell users to use 'salloc' and 'srun' to run interactive jobs. They can easily open a Bash session on a compute node and do what they like. Under the Slurm scheduler.

b) implement the pam-slurm PAMmodule. It is a few minutes work. This means your users cannot go behind the sluem scheduler and log into the nodes

c) on Bright clusters which I configure, you have a healtcheck running which wans you when a user is detected as logging in withotu using Slurm

Seriously again. You have implemented an HPC infrastructure, and have gone to the time and effort to implement a batch scheduling system.
A batch scheduler can be adapted to let your users do their jobs, including interactive shell sessions and remote visualization sessions.
Do not let the users ride roughshod over you.

________________________________________

From: kesim [keti...@gmail.com<mailto:keti...@gmail.com>]

Sent: 18 March 2017 16:16
To: slurm-dev
Subject: [slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

Unbelievable but it seems that nobody knows how to do that. It is astonishing that such sophisticated system fails with such simple problem. The slurm is aware about the cpu load of non-slurm jobs but it does not use the info. My original understanding of LLN was apparently correct. I can practically kill the CPUs on particular node with nonslurm tasks but slurm will diligently submit 7 jobs to this node leaving other idling. I consider this as a serious bug of this program.

On Fri, Mar 17, 2017 at 10:32 AM, kesim <keti...@gmail.com<mailto:keti...@gmail.com><mailto:keti...@gmail.com<mailto:keti...@gmail.com>>> wrote:
Dear All,
Yesterday I did some tests and it seemed that the scheduling is following CPU load but I was wrong.
My configuration is at the moment:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU,CR_LLN

Today I submitted 70 threaded jobs to the queue and here is the CPU_LOAD info
node1 0.08 7/0/0/7
node2 0.01 7/0/0/7
node3 0.00 7/0/0/7
node4 2.97 7/0/0/7
node5 0.00 7/0/0/7
node6 0.01 7/0/0/7
node7 0.00 7/0/0/7
node8 0.05 7/0/0/7
node9 0.07 7/0/0/7
node10 0.38 7/0/0/7
node11 0.01 0/7/0/7
As you can see it allocated 7 CPUs on node 4 with CPU_LOAD 2.97 and 0 CPUs on idling node11. Why such simple thing is not a default? What am I missing???

On Thu, Mar 16, 2017 at 7:53 PM, kesim <keti...@gmail.com<mailto:keti...@gmail.com><mailto:keti...@gmail.com<mailto:keti...@gmail.com>>> wrote:
Than you for great suggestion. It is working! However the description of CR_LLN is misleading "Schedule resources to jobs on the least loaded nodes (based upon the number of idle CPUs)" Which I understood that if the two nodes has not fully allocated CPUs the node with smaller number of allocated CPUs will take precedence. Therefore the bracketed comment should be removed from the description.

TO_Webmaster

unread,

Mar 19, 2017, 6:27:13 AM3/19/17

to slurm-dev

Please remember that might lead to a huge waste of resources. Imagine
you have a cluster with 10 nodes with 10 cores eachs. Then somebody
submits 10 jobs requesting 1 core per job. If I understand you
correctly, you would like to see one job per node then? Now imagine
someone else submits 9 jobs requesting the nodes exclusively. Then
none of these 9 jobs can start, because there is one job with one core
on each node. If the former 10 jobs had been packed on one node, all
of the latter 9 jobs could have started immediately.

kesim

unread,

Mar 19, 2017, 8:25:37 AM3/19/17

to slurm-dev

I have 11 nodes and declared 7 CPUs per node. My setup is such that all desktop belongs to group members who are using them mainly as graphics stations. Therefore from time to time an application is requesting high CPU usage. Firefox can do it easily. We also have applications which were compiled with intel MPI and the whole setup is mainly for them. I would like my scheduler to fully fill nodes with tasks but starting from idling nodes. Let say I looked a the CPU load of my nodes (sinfo -N -o '%N %O %C will do that) and since 2 nodes have load ~2 (usually it more or less means that they use 2 processors at 100%) I want to use 73 instead 77 available processors and my simple minded understanding is that in the node which have CPU load ~2 the two processors will be the last to allocate even though they are technically available. This what a scheduler should do on its own without my intervention. However sadly it is not what happens. If I request 73 processors the scheduler does not take into account the real CPU load and it is filling the nodes alphabetically. Since sinfo is aware of the CPU load slurm should take it into account when filling nodes and this is a serious bug that it is not doing that.

I use slurm 17.02.1-2 in the Ubutntu 16.04 environment.

Will French

unread,

Mar 19, 2017, 9:57:31 AM3/19/17

to slurm-dev

Just because the scheduler does not do what you want or expect by default does not make this a bug. A bug would imply some unexpected behavior due to an error or unanticipated condition within the SLURM source code. I can’t speak for the developers, but it might be that this default behavior you keep referring to as a “bug” was an intentional design decision for efficiency reasons. Job scheduling is an incredibly complex task and by almost all metrics SLURM is currently the most efficient.

Also consider that SLURM was designed for massive HPC environments and that your setup is a significant departure from this. It is not unreasonable at all that you need to alter the default configuration of SLURM in order to run in a setup involving workstations, interactive use, processes unmanaged by SLURM, etc. I suspect that is a pretty massive departure from the use case SLURM was targeting when it was initially developed.

Will

kesim

unread,

Mar 19, 2017, 10:38:05 AM3/19/17

to slurm-dev

Dear Will,

I am not trying here to diminish the value of slurm. I only want to find the solution for the trivial problem. I also think that slurm was design for HPC and it is performing well in such env. I agree with you that my env. hardly qualifies as HPC but still one of the simplest concept behind any scheduler is to not overload some nodes when the others are idling - can it really be by design? I cannot also speak for developers but it probably needs a few lines of code to add this feature considering that the data is already collected. As far as I understand there is no default slurm installation - you have to adopt it to your env. and it is quite flexible. I tried a lot but unfortunately I failed to achieve my simple goal.

Best regards,

Ketiw

Benjamin Redling

unread,

Mar 19, 2017, 11:09:24 AM3/19/17

to slurm-dev

Am 19.03.2017 um 15:36 schrieb kesim:
> ... I only want to find

> the solution for the trivial problem. I also think that slurm was design
> for HPC and it is performing well in such env. I agree with you that my
> env. hardly qualifies as HPC but still one of the simplest concept
> behind any scheduler is to not overload some nodes when the others are
> idling - can it really be by design? I cannot also speak for developers
> but it probably needs a few lines of code to add this feature
> considering that the data is already collected.

(A lot of [rarely used] features might be just away a few extra lines of
code [nobody contributes, or even pays for].)

If you want to utilize the resources of desktops you might want to have
a look at HTCondor.

BR
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321

Andy Riebs

unread,

Mar 19, 2017, 11:13:10 AM3/19/17

to slurm-dev

Ketiw,

Slurm is really good at the incredibly complex job of managing multi-node (tens, hundreds, thousands, ...) workloads where thousands or hundreds of thousands of cooperating threads expect to be able to correspond with each other within microseconds. To allow random, unmanaged-by-Slurm user programs to use some of the available cycles would break many HPC workloads. (The Linux HPC community has spent years just figuring out how to minimize the impact of predictable and well-known system services!)

For your case, I wonder if something like the Sun Grid Engine or the Open Grid Scheduler might be more appropriate.

Andy

-- 
Andy Riebs
andy....@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!

Christopher Samuel

unread,

Mar 20, 2017, 12:38:57 AM3/20/17

to slurm-dev

On 19/03/17 23:25, kesim wrote:

> I have 11 nodes and declared 7 CPUs per node. My setup is such that all
> desktop belongs to group members who are using them mainly as graphics
> stations. Therefore from time to time an application is requesting high
> CPU usage.

In this case I would suggest you carve off 3 cores via cgroups for
interactive users and give Slurm the other 7 to parcel out to jobs by
ensuring that Slurm starts within a cgroup dedicated to those 7 cores..

This is similar to the "boot CPU set" concept that SGI came up with (at
least I've not come across people doing that before them).

To be fair this is not really Slurm's problem to solve, Linux gives you
the tools to do this already, it's just that people don't realise that
you can use cgroups to do this.

Your use case is valid, but it isn't really HPC, and you can't really
blame Slurm for not catering to this. It can use cgroups to partition
cores to jobs precisely so it doesn't need to care what the load average
is - it knows the kernel is ensuring the cores the jobs want are not
being stomped on by other tasks.

Best of luck!
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

Markus Koeberl

unread,

Mar 20, 2017, 10:32:32 AM3/20/17

to slurm-dev, Christopher Samuel

On Monday 20 March 2017 05:38:29 Christopher Samuel wrote:
>
> On 19/03/17 23:25, kesim wrote:
>
> > I have 11 nodes and declared 7 CPUs per node. My setup is such that all
> > desktop belongs to group members who are using them mainly as graphics
> > stations. Therefore from time to time an application is requesting high
> > CPU usage.
>
> In this case I would suggest you carve off 3 cores via cgroups for
> interactive users and give Slurm the other 7 to parcel out to jobs by
> ensuring that Slurm starts within a cgroup dedicated to those 7 cores..
>
> This is similar to the "boot CPU set" concept that SGI came up with (at
> least I've not come across people doing that before them).
>
> To be fair this is not really Slurm's problem to solve, Linux gives you
> the tools to do this already, it's just that people don't realise that
> you can use cgroups to do this.
>
> Your use case is valid, but it isn't really HPC, and you can't really
> blame Slurm for not catering to this. It can use cgroups to partition
> cores to jobs precisely so it doesn't need to care what the load average
> is - it knows the kernel is ensuring the cores the jobs want are not
> being stomped on by other tasks.

You could additionally define a higher "Weight" value for a host if you know that the load is usually higher on it than on the others.

regards
Markus Köberl
--
Markus Koeberl
Graz University of Technology
Signal Processing and Speech Communication Laboratory
E-mail: markus....@tugraz.at

kesim

unread,

Mar 21, 2017, 3:42:37 PM3/21/17

to slurm-dev

Dear SLURM Users,

My response here is for those who are trying to solve the simple problem of nodes ordering according to the CPU load. Actually, Markus was right and he gave me the idea (THANKS!!!)

The solution is not pretty but it works and it has a lot of flexibility. Just put into crone a script:

#!/bin/sh

scontrol update node=your_node_name WEIGHT=`echo 100*$(uptime | awk -F'[, ]' '{print $21}')/1 | bc`

Best Regards,

Ketiw

kesim

unread,

Mar 21, 2017, 4:17:18 PM3/21/17

to slurm-dev

There is an error in the script. It could be:

scontrol update node=your_node_name WEIGHT=`echo 100*$(uptime | awk '{print $12}')/1 | bc`

Benjamin Redling

unread,

Mar 21, 2017, 5:05:31 PM3/21/17

to slurm-dev

Hi,

if you don't want to depend on the whitespaces in the output of "uptime"
(the number of fields depends on a locale) you can improve that via "awk
'{print $3}' /proc/loadavg" (for the 15min avg) -- it's always better to
avoid programmatically accessing output made for humans as long as possible.

Nice hack anyway!

Regards,
Benjamin

> E-mail: markus....@tugraz.at <mailto:markus....@tugraz.at>

Benjamin Redling

unread,

Mar 21, 2017, 5:25:20 PM3/21/17

to slurm-dev

re hi,

your script will occasionally fail because the number of fields in the
output of "uptime" is variable.
I was reminded by this one:
http://stackoverflow.com/questions/11735211/get-last-five-minutes-load-average-using-ksh-with-uptime

Even more a reason to use /proc...

Regards,
Benjamin

Am 21.03.2017 um 21:15 schrieb kesim:

kesim

unread,

Mar 21, 2017, 5:35:14 PM3/21/17

to slurm-dev

You are right. Many thanks for correcting.

Christopher Samuel

unread,

Mar 22, 2017, 12:10:54 AM3/22/17

to slurm-dev

On 22/03/17 08:35, kesim wrote:

> You are right. Many thanks for correcting.

Just note that load average is not necessarily the same as CPU load.

If you have tasks blocked for I/O they will contribute to load average
but will not be using much CPU at all.

So, for instance, on one of our compute nodes a Slurm job can ask for 1
core, start 100 tasks doing heavy I/O, they all use the same 1 core and
get the load average to 100 but the other 31 cores on the node are idle
and can quite safely be used for HPC work.

The manual page for "uptime" on RHEL7 describes it thus:

# System load averages is the average number of processes that
# are either in a runnable or uninterruptable state. A process
# in a runnable state is either using the CPU or waiting to use
# the CPU. A process in uninterruptable state is waiting for
# some I/O access, eg waiting for disk.

All the best,

kesim

unread,

Mar 22, 2017, 3:53:16 AM3/22/17

to slurm-dev

Yes, I agree. However, as I pointed out in my previous emails the whole exercise is not to restrict the nodes but to order them. If everything is equal submitting jobs to idling nodes first make much more sense than to busy ones even if busy means only I/O operations (by the way, it is my experience that even I/O operations slow down the mpi calculations). It will be then user's choice when looking at the CPU load of the system to decide how many processors she should request for her job.

Best Regards

kesim

unread,

Mar 22, 2017, 5:05:57 AM3/22/17

to slurm-dev

Dear All,

I discovered another oddity this time about WEIGHT. The manual states: "Note that if a job allocation request can not be satisfied using the nodes with the lowest weight, the set of nodes with the next lowest weight is added to the set of nodes under consideration."

However it is not exactly what is happening. In the example below I requested 66 nodes out of 77 available on 11 weighted nodes:

NODELIST CPU_LOAD CPUS(A/I/O/T) WEIGHT

node1 0.12 7/0/0/7 8

node2 0.08 7/0/0/7 1

node3 0.00 7/0/0/7 0

node4 0.08 7/0/0/7 9

node5 0.00 3/4/0/7 0

node6 0.30 0/7/0/7 33

node7 0.00 7/0/0/7 0

node8 0.04 7/0/0/7 5

node9 0.00 7/0/0/7 0

node10 0.00 7/0/0/7 0

node11 0.01 7/0/0/7 22

What happens is, slurm skipped the highest weighted node6 as it should but then it allocated 7 CPUs on the next highest weighted node11 and 3 CPUs on the least weighted node5. Is it also by design?

Best regards,

Reply all

Reply to author

Forward