[PATCH] sched: Improve load balancing in the presence of idle CPUs

Preeti U Murthy

unread,

Mar 26, 2015, 4:00:08 AM3/26/15

to

When a CPU is kicked to do nohz idle balancing, it wakes up to do load balancing
on itself, followed by load balancing on behalf of idle CPUs. But it may end
up with load after the load balancing attempt on itself. This aborts nohz
idle balancing. As a result several idle CPUs are left without tasks till such a
time that an ILB CPU finds it unfavorable to pull tasks upon itself. This delays
spreading of load across idle CPUs and worse, clutters only a few CPUs with
tasks.

The effect of the above problem was observed on an SMT8 POWER server with 2 levels
of numa domains. Busy loops equal to number of cores were spawned. Since load balancing
on fork/exec is discouraged across numa domains, all busy loops would start on one of
the numa domains. However it was expected that eventually one busy loop would run per
core across all domains due to nohz idle load balancing. But it was observed that
it took as long as 10 seconds to spread the load across numa domains.

Further investigation showed that this was a consequence of the following:

1. An ILB CPU was chosen from the first numa domain to trigger nohz idle load balancing
[Given the experiment, upto 6 CPUs per core could be potentially idle in this domain.]

2. However the ILB CPU would call load_balance() on itself before initiating nohz idle
load balancing.

3. Given cores are SMT8, the ILB CPU had enough opportunities to pull tasks from its
sibling cores to even out load.

4. Now that the ILB CPU was no longer idle, it would abort nohz idle load balancing

As a result the opportunities to spread load across numa domains were lost until such a
time that the cores within the first numa domain had equal number of tasks among themselves.
This is a pretty bad scenario, since the cores within the first numa domain would have
as many as 4 tasks each, while cores in the neighbouring numa domains would all remain idle.

Fix this, by checking if a CPU was woken up to do nohz idle load balancing, before it does
load balancing upon itself. This way we allow idle CPUs across the system to do load balancing
which results in quicker spread of load, instead of performing load balancing within the local
sched domain hierarchy of the ILB CPU alone under circumstances such as above.

Signed-off-by: Preeti U Murthy <pre...@linux.vnet.ibm.com>
---

kernel/sched/fair.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcfe320..95b00d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7660,14 +7660,13 @@ static void run_rebalance_domains(struct softirq_action *h)
enum cpu_idle_type idle = this_rq->idle_balance ?
CPU_IDLE : CPU_NOT_IDLE;

- rebalance_domains(this_rq, idle);
-
/*
* If this cpu has a pending nohz_balance_kick, then do the
* balancing on behalf of the other idle cpus whose ticks are
* stopped.
*/
nohz_idle_balance(this_rq, idle);
+ rebalance_domains(this_rq, idle);
}

/*

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,

Mar 26, 2015, 6:30:07 AM3/26/15

to

Please teach your favourite text editor to wrap Changelog text at 72
chars (because git show shows the Changelog indented by 1 tabstop).

For vim I can offer:

autocmd BufNewFile,BufRead *.patch setlocal textwidth=72

> Signed-off-by: Preeti U Murthy <pre...@linux.vnet.ibm.com>
> ---
>
> kernel/sched/fair.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bcfe320..95b00d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7660,14 +7660,13 @@ static void run_rebalance_domains(struct softirq_action *h)
> enum cpu_idle_type idle = this_rq->idle_balance ?
> CPU_IDLE : CPU_NOT_IDLE;
>
> - rebalance_domains(this_rq, idle);
> -
> /*
> * If this cpu has a pending nohz_balance_kick, then do the
> * balancing on behalf of the other idle cpus whose ticks are
> * stopped.
> */
> nohz_idle_balance(this_rq, idle);
> + rebalance_domains(this_rq, idle);
> }

Might a comment be called for that explains all this -- albeit a little
less verbose?

Preeti U Murthy

unread,

Mar 26, 2015, 9:10:06 AM3/26/15

to

Signed-off-by: Preeti U Murthy <pre...@linux.vnet.ibm.com>
---

Changes from V1:
1. Added relevant comments
2. Wrapped lines to a fixed width in the changelog

kernel/sched/fair.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcfe320..8b6d0d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7660,14 +7660,16 @@ static void run_rebalance_domains(struct softirq_action *h)

enum cpu_idle_type idle = this_rq->idle_balance ?
CPU_IDLE : CPU_NOT_IDLE;

- rebalance_domains(this_rq, idle);
-
/*
* If this cpu has a pending nohz_balance_kick, then do the
* balancing on behalf of the other idle cpus whose ticks are

- * stopped.
+ * stopped. Do nohz_idle_balance *before* rebalance_domains to
+ * give the idle cpus a chance to load balance. Else we may
+ * load balance only within the local sched_domain hierarchy
+ * and abort nohz_idle_balance altogether if we pull some load.

*/
nohz_idle_balance(this_rq, idle);
+ rebalance_domains(this_rq, idle);
}

/*

Jason Low

unread,

Mar 26, 2015, 1:10:05 PM3/26/15

to

On Thu, 2015-03-26 at 18:32 +0530, Preeti U Murthy wrote:
> kernel/sched/fair.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bcfe320..8b6d0d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7660,14 +7660,16 @@ static void run_rebalance_domains(struct softirq_action *h)
> enum cpu_idle_type idle = this_rq->idle_balance ?
> CPU_IDLE : CPU_NOT_IDLE;
>
> - rebalance_domains(this_rq, idle);
> -
> /*
> * If this cpu has a pending nohz_balance_kick, then do the
> * balancing on behalf of the other idle cpus whose ticks are
> - * stopped.
> + * stopped. Do nohz_idle_balance *before* rebalance_domains to
> + * give the idle cpus a chance to load balance. Else we may
> + * load balance only within the local sched_domain hierarchy
> + * and abort nohz_idle_balance altogether if we pull some load.
> */
> nohz_idle_balance(this_rq, idle);
> + rebalance_domains(this_rq, idle);

Reviewed-by: Jason Low <jason...@hp.com>

Wanpeng Li

unread,

Mar 26, 2015, 10:40:06 PM3/26/15

to

Hi Preeti,

On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
>
>1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
>load balancing [Given the experiment, upto 6 CPUs per core could be
>potentially idle in this domain.]
>
>2. However the ILB CPU would call load_balance() on itself before
>initiating nohz idle load balancing.
>
>3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
>tasks from its sibling cores to even out load.
>
>4. Now that the ILB CPU was no longer idle, it would abort nohz idle
>load balancing

I don't see abort nohz idle load balancing when ILB CPU was no longer idle
in nohz_idle_balance(), could you explain more in details?

Regards,
Wanpeng Li

Preeti U Murthy

unread,

Mar 27, 2015, 12:40:05 AM3/27/15

to

Hi Wanpeng

On 03/27/2015 07:42 AM, Wanpeng Li wrote:
> Hi Preeti,
> On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
>>
>> 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
>> load balancing [Given the experiment, upto 6 CPUs per core could be
>> potentially idle in this domain.]
>>
>> 2. However the ILB CPU would call load_balance() on itself before
>> initiating nohz idle load balancing.
>>
>> 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
>> tasks from its sibling cores to even out load.
>>
>> 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
>> load balancing
>
> I don't see abort nohz idle load balancing when ILB CPU was no longer idle
> in nohz_idle_balance(), could you explain more in details?

When the ILB CPU pulls load in rebalance_domains(), its idle state
is set to CPU_NOT_IDLE.

""
idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;

And,

When nohz_idle_balance() is called, the state of idle of ILB CPU is
checked before proceeding with load balancing on idle CPUs.

""

if (idle != CPU_IDLE ||
!test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
goto end;

You see that nohz idle balancing gets aborted.

Regards
Preeti U Murthy

Wanpeng Li

unread,

Mar 27, 2015, 12:50:05 AM3/27/15

to

Hi Preeti,

On Fri, Mar 27, 2015 at 10:03:21AM +0530, Preeti U Murthy wrote:
>is set to CPU_NOT_IDLE.
>
>""
>idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
>
>And,
>
>When nohz_idle_balance() is called, the state of idle of ILB CPU is
>checked before proceeding with load balancing on idle CPUs.
>
>""
>
>if (idle != CPU_IDLE ||
> !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
> goto end;
>

This idle is the one caculated in run_rebalance_domains() before invoke
rebalance_domains(), rebalance_domains() cannot modify this local
variable.

Regards,
Wanpeng Li

Jason Low

unread,

Mar 27, 2015, 1:10:05 AM3/27/15

to

On Fri, 2015-03-27 at 10:12 +0800, Wanpeng Li wrote:
> Hi Preeti,
> On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
> >
> >1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
> >load balancing [Given the experiment, upto 6 CPUs per core could be
> >potentially idle in this domain.]
> >
> >2. However the ILB CPU would call load_balance() on itself before
> >initiating nohz idle load balancing.
> >
> >3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
> >tasks from its sibling cores to even out load.
> >
> >4. Now that the ILB CPU was no longer idle, it would abort nohz idle
> >load balancing
>
> I don't see abort nohz idle load balancing when ILB CPU was no longer idle
> in nohz_idle_balance(), could you explain more in details?

Hi Wanpeng,

In nohz_idle_balance(), there is a check for need_resched() so if the
cpu has something to run, it should exit nohz_idle_balance(), which may
cause it to not do the idle balancing on the other CPUs.

Jason Low

unread,

Mar 27, 2015, 1:10:07 AM3/27/15

to

On Fri, 2015-03-27 at 10:03 +0530, Preeti U Murthy wrote:
> Hi Wanpeng
>
> On 03/27/2015 07:42 AM, Wanpeng Li wrote:
> > Hi Preeti,
> > On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
> >>
> >> 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
> >> load balancing [Given the experiment, upto 6 CPUs per core could be
> >> potentially idle in this domain.]
> >>
> >> 2. However the ILB CPU would call load_balance() on itself before
> >> initiating nohz idle load balancing.
> >>
> >> 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
> >> tasks from its sibling cores to even out load.
> >>
> >> 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
> >> load balancing
> >
> > I don't see abort nohz idle load balancing when ILB CPU was no longer idle
> > in nohz_idle_balance(), could you explain more in details?
>
> When the ILB CPU pulls load in rebalance_domains(), its idle state
> is set to CPU_NOT_IDLE.
>
> ""
> idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;

Hi Preeti,

The "idle" variable is a local variable to the rebalance_domains()
function. In that case, that shouldn't have an affect on the idle value
that gets passed to nohz_idle_balance().

Srikar Dronamraju

unread,

Mar 27, 2015, 1:50:06 AM3/27/15

to

* Jason Low <jason...@hp.com> [2015-03-26 22:07:21]:

> On Fri, 2015-03-27 at 10:12 +0800, Wanpeng Li wrote:
> > Hi Preeti,
> > On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
> > >
> > >1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
> > >load balancing [Given the experiment, upto 6 CPUs per core could be
> > >potentially idle in this domain.]
> > >
> > >2. However the ILB CPU would call load_balance() on itself before
> > >initiating nohz idle load balancing.
> > >
> > >3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
> > >tasks from its sibling cores to even out load.
> > >
> > >4. Now that the ILB CPU was no longer idle, it would abort nohz idle
> > >load balancing
> >
> > I don't see abort nohz idle load balancing when ILB CPU was no longer idle
> > in nohz_idle_balance(), could you explain more in details?
>
> Hi Wanpeng,
>
> In nohz_idle_balance(), there is a check for need_resched() so if the
> cpu has something to run, it should exit nohz_idle_balance(), which may
> cause it to not do the idle balancing on the other CPUs.
>

Yes, the need_resched() in nohz_idle_balance() would exit the
nohz_idle_balance if it has something to run. However I wonder if we
should move the need_resched check out of the for loop. i.e the
need_resched check should probably be there with the idle check.

With the current code when the ilb cpus are not free:
- We would be updating the nohz.next_balance even through we havent done
any load balance.
- We might iterate thro the nohz.idle_cpus_mask()s to find balance_cpus.

--
Thanks and Regards
Srikar Dronamraju

Wanpeng Li

unread,

Mar 27, 2015, 3:10:05 AM3/27/15

to

On Thu, Mar 26, 2015 at 10:07:21PM -0700, Jason Low wrote:
>On Fri, 2015-03-27 at 10:12 +0800, Wanpeng Li wrote:
>> Hi Preeti,
>> On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
>> >
>> >1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
>> >load balancing [Given the experiment, upto 6 CPUs per core could be
>> >potentially idle in this domain.]
>> >
>> >2. However the ILB CPU would call load_balance() on itself before
>> >initiating nohz idle load balancing.
>> >
>> >3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
>> >tasks from its sibling cores to even out load.
>> >
>> >4. Now that the ILB CPU was no longer idle, it would abort nohz idle
>> >load balancing
>>
>> I don't see abort nohz idle load balancing when ILB CPU was no longer idle
>> in nohz_idle_balance(), could you explain more in details?
>
>Hi Wanpeng,
>
>In nohz_idle_balance(), there is a check for need_resched() so if the
>cpu has something to run, it should exit nohz_idle_balance(), which may
>cause it to not do the idle balancing on the other CPUs.

Got it, thanks. ;)

Regards,
Wanpeng Li

Wanpeng Li

unread,

Mar 27, 2015, 3:20:06 AM3/27/15

to

Hi Srikar,

On Fri, Mar 27, 2015 at 11:09:07AM +0530, Srikar Dronamraju wrote:
>
>Yes, the need_resched() in nohz_idle_balance() would exit the
>nohz_idle_balance if it has something to run. However I wonder if we
>should move the need_resched check out of the for loop. i.e the
>need_resched check should probably be there with the idle check.
>
>With the current code when the ilb cpus are not free:
>- We would be updating the nohz.next_balance even through we havent done
> any load balance.
>- We might iterate thro the nohz.idle_cpus_mask()s to find balance_cpus.

Good idea, the nohz load balance will be delay since nohz.next_balance is
updated inapposite when current ILB just be busy. I will send a patch to
fix it. ;)

Regards,
Wanpeng Li

tip-bot for Preeti U Murthy

unread,

Mar 27, 2015, 7:50:07 AM3/27/15

to

Commit-ID: d4573c3e1c992668f5dcd57d1c2ced56ae9650b9
Gitweb: http://git.kernel.org/tip/d4573c3e1c992668f5dcd57d1c2ced56ae9650b9
Author: Preeti U Murthy <pre...@linux.vnet.ibm.com>
AuthorDate: Thu, 26 Mar 2015 18:32:44 +0530
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Fri, 27 Mar 2015 09:36:09 +0100

sched: Improve load balancing in the presence of idle CPUs

Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Reviewed-by: Jason Low <jason...@hp.com>
Cc: be...@kernel.crashing.org
Cc: daniel....@linaro.org
Cc: efa...@gmx.de
Cc: iamjoon...@lge.com
Cc: morten.r...@arm.com
Cc: p...@google.com
Cc: ri...@redhat.com
Cc: sri...@linux.vnet.ibm.com
Cc: sva...@linux.vnet.ibm.com
Cc: tim.c...@linux.intel.com
Cc: vincent...@linaro.org
Link: http://lkml.kernel.org/r/20150326130014.2...@preeti.in.ibm.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---

kernel/sched/fair.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

index 3a798ec..46855d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7753,14 +7753,16 @@ static void run_rebalance_domains(struct softirq_action *h)

Srikar Dronamraju

unread,

Mar 27, 2015, 9:10:05 AM3/27/15

to

> When a CPU is kicked to do nohz idle balancing, it wakes up to do load
> balancing on itself, followed by load balancing on behalf of idle CPUs.
> But it may end up with load after the load balancing attempt on itself.
> This aborts nohz idle balancing. As a result several idle CPUs are left
> without tasks till such a time that an ILB CPU finds it unfavorable to
> pull tasks upon itself. This delays spreading of load across idle CPUs
> and worse, clutters only a few CPUs with tasks.
>

[..... snipped .... ]

> Fix this, by checking if a CPU was woken up to do nohz idle load
> balancing, before it does load balancing upon itself. This way we allow
> idle CPUs across the system to do load balancing which results in
> quicker spread of load, instead of performing load balancing within the
> local sched domain hierarchy of the ILB CPU alone under circumstances
> such as above.
>
> Signed-off-by: Preeti U Murthy <pre...@linux.vnet.ibm.com>
> ---

Looks good to me.

Reviewed-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com>

Morten Rasmussen

unread,

Mar 27, 2015, 10:40:06 AM3/27/15

to

Hi Preeti,

IIUC, this change means that you will always wake up one more cpu than
necessary unless you have enough work for all cpus in the system. For
example, cpu0 is busy with two tasks and cpu1+2 are nohz_idle. cpu0
kicks cpu1 to do a nohz_idle_balance(). With the change it will balance
on behalf of cpu2 first and pull one of the tasks from cpu0. When done
with nohz_idle_balance() cpu1 has nothing left to pull when balancing
itself and goes back to sleep.

My concern is that this will increase the number of cpu wakeups quite
significantly. Am I missing something?

Morten

Preeti U Murthy

unread,

Mar 27, 2015, 12:30:06 PM3/27/15

to

Hi Wanpeng, Jason,

On 03/27/2015 10:37 AM, Jason Low wrote:
> On Fri, 2015-03-27 at 10:12 +0800, Wanpeng Li wrote:
>> Hi Preeti,
>> On Thu, Mar 26, 2015 at 06:32:44PM +0530, Preeti U Murthy wrote:
>>>
>>> 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
>>> load balancing [Given the experiment, upto 6 CPUs per core could be
>>> potentially idle in this domain.]
>>>
>>> 2. However the ILB CPU would call load_balance() on itself before
>>> initiating nohz idle load balancing.
>>>
>>> 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
>>> tasks from its sibling cores to even out load.
>>>
>>> 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
>>> load balancing
>>
>> I don't see abort nohz idle load balancing when ILB CPU was no longer idle
>> in nohz_idle_balance(), could you explain more in details?
>
> Hi Wanpeng,
>
> In nohz_idle_balance(), there is a check for need_resched() so if the
> cpu has something to run, it should exit nohz_idle_balance(), which may
> cause it to not do the idle balancing on the other CPUs.

You are right, it is need_resched() causing the issue. I am sorry I
overlooked that 'idle' was not a pointer. Thanks a lot for pointing out
the issue :)

But the patch still does good as you will agree, and the changelog has
no specific mention to the 'idle' parameter but describes the issue in
general. Hence, we are good.

Regards
Preeti U Murthy

Preeti U Murthy

unread,

Mar 27, 2015, 12:50:06 PM3/27/15

to

Hi Morten,

Its true that we wakeup all idle CPUs. But I think we are justified in
doing so, given that nohz_idle_balance() was deemed necessary. The logic
behind nohz_idle_balance() as I see it is that, all idle CPUs should be
brought to action when some scheduling group is found to be busy. With
the current code that does not happen if one of them happen to pull
tasks. This logic does not make sense to me.

With the current code, I think it is hard to estimate how many idle CPU
wakeups would be sufficient to balance out the system load. But I
certainly feel that waking up all of them to perform the load balancing
that was asked for, is far better than waking up none of them. This is
in favor of performance. I agree the extra wakeups would do us harm with
power savings, but I would still be fine with it, given the undesirable
scenario that occurs as a consequence, as described in the changelog.

Regards
Preeti U Murthy

Morten Rasmussen

unread,

Mar 27, 2015, 2:00:07 PM3/27/15

to

I agree that the current behaviour is undesirable and should be fixed,
but IMHO waking up all idle cpus can not be justified. It is only one
additional cpu though with your patch so it isn't quite that bad.

I agree that it is hard to predict how many additional cpus you need,
but I don't think you necessarily need that information as long as you
start by filling up the cpu that was kicked to do the
nohz_idle_balance() first.

You would also solve your problem if you removed the ability for the cpu
to bail out after balancing itself and force it to finish the job. It
would mean harming tasks that where pulled to the balancing cpu as they
would have to wait being scheduling until the nohz_idle_balance() has
completed. It could be a price worth paying.

An alternative could be to let the balancing cpu balance itself first
and bail out as it currently does, but let it kick the next nohz_idle
cpu to continue the job if it thinks there is more work to be done. So
you would get a chain of kicks that would stop when there nothing
more to do be done. It isn't quite as fast as your solution as it would
require an IPI plus wakeup for each cpu to continue the work. But it
should be much faster than the current code I think.

IMHO it makes more sense to stay with the current scheme of ensuring
that the kicked cpu is actually used before waking up more cpus and
instead improve how additional cpus are kicked if they are needed.

Reducing unnecessary wakeups is quite important for energy consumption
and something a lot of effort is put into. You really don't want to wake
up another cluster/package unnecessarily just because there was only one
nohz-idle cpu left in the previous one which could have handled the
additional load. It gets even worse if the other cluster is less
energy-efficient (big.LITTLE).

Preeti U Murthy

unread,

Mar 30, 2015, 3:30:06 AM3/30/15

to

Hi Morten,

On 03/27/2015 11:26 PM, Morten Rasmussen wrote:
>
> I agree that the current behaviour is undesirable and should be fixed,
> but IMHO waking up all idle cpus can not be justified. It is only one
> additional cpu though with your patch so it isn't quite that bad.
>
> I agree that it is hard to predict how many additional cpus you need,
> but I don't think you necessarily need that information as long as you
> start by filling up the cpu that was kicked to do the
> nohz_idle_balance() first.
>
> You would also solve your problem if you removed the ability for the cpu
> to bail out after balancing itself and force it to finish the job. It
> would mean harming tasks that where pulled to the balancing cpu as they
> would have to wait being scheduling until the nohz_idle_balance() has
> completed. It could be a price worth paying.

But how would this prevent waking up idle CPUs ? You still end up waking
up all idle CPUs, wouldn't you?

>
> An alternative could be to let the balancing cpu balance itself first
> and bail out as it currently does, but let it kick the next nohz_idle
> cpu to continue the job if it thinks there is more work to be done. So
> you would get a chain of kicks that would stop when there nothing
> more to do be done. It isn't quite as fast as your solution as it would

I am afraid there is more to this. If a given CPU is unable to pull
tasks, it could mean that it is an unworthy destination CPU. But it does
not mean that the other idle CPUs are unworthy of balancing too.

So if the ILB CPU stops waking up idle CPUs when it has nothing to pull,
we will end up hurting load balancing. Take for example the scenario
described in the changelog. The idle CPUs within a numa node may find
load balanced within themselves and hence refrain from pulling any load.
If these ILB CPUs stop nohz idle load balancing at this point, the load
will never get spread across nodes.

If on the other hand, if we keep kicking idle CPUs to carry on idle load
balancing, the wakeup scenario will be no better than it is with this patch.

> require an IPI plus wakeup for each cpu to continue the work. But it
> should be much faster than the current code I think.
>
> IMHO it makes more sense to stay with the current scheme of ensuring
> that the kicked cpu is actually used before waking up more cpus and
> instead improve how additional cpus are kicked if they are needed.

It looks more sensible to do this in parallel. The scenario on POWER is
that tasks don't spread out across nodes until 10s of fork. This is
unforgivable and we cannot afford the code to be the way it is today.

Regards
Preeti U Murthy

Peter Zijlstra

unread,

Mar 30, 2015, 7:10:06 AM3/30/15

to

On Fri, Mar 27, 2015 at 05:56:51PM +0000, Morten Rasmussen wrote:

> I agree that it is hard to predict how many additional cpus you need,
> but I don't think you necessarily need that information as long as you
> start by filling up the cpu that was kicked to do the
> nohz_idle_balance() first.

> Reducing unnecessary wakeups is quite important for energy consumption
> and something a lot of effort is put into. You really don't want to wake
> up another cluster/package unnecessarily just because there was only one
> nohz-idle cpu left in the previous one which could have handled the
> additional load. It gets even worse if the other cluster is less
> energy-efficient (big.LITTLE).

So the only way to get tasks to cross your cluster is by balancing that
domain. At this point we'll compute sg stats for either group
(=cluster).

The only thing we need to ensure is that it doesn't view the small
cluster as overloaded (as long as it really isn't of course), as long as
its not viewed as overloaded it will not pull tasks from it into the big
cluster, no matter how many ILBs we run before the ILB duty cpu's
rebalance_domains() call.

I'm really not seeing the problem here.

Morten Rasmussen

unread,

Mar 30, 2015, 7:30:07 AM3/30/15

to

On Mon, Mar 30, 2015 at 08:26:19AM +0100, Preeti U Murthy wrote:
> Hi Morten,
>
> On 03/27/2015 11:26 PM, Morten Rasmussen wrote:
> >
> > I agree that the current behaviour is undesirable and should be fixed,
> > but IMHO waking up all idle cpus can not be justified. It is only one
> > additional cpu though with your patch so it isn't quite that bad.
> >
> > I agree that it is hard to predict how many additional cpus you need,
> > but I don't think you necessarily need that information as long as you
> > start by filling up the cpu that was kicked to do the
> > nohz_idle_balance() first.
> >
> > You would also solve your problem if you removed the ability for the cpu
> > to bail out after balancing itself and force it to finish the job. It
> > would mean harming tasks that where pulled to the balancing cpu as they
> > would have to wait being scheduling until the nohz_idle_balance() has
> > completed. It could be a price worth paying.
>
> But how would this prevent waking up idle CPUs ? You still end up waking
> up all idle CPUs, wouldn't you?

That depends on the scenario. In your example from the changelog you
would. You have enough work for all the nohz-idle cpus and you will keep
iterating through all of them and pull work on their behalf and hence
wake them up. But in a scenario where you don't have enough work for all
nohz-idle cpus you are guaranteed that the balancer cpu has taken its
share and doesn't go back to sleep immediately after finish
nohz_idle_balance(). So all cpus woken up will have a task to run
including the balancer cpu.

>
> >
> > An alternative could be to let the balancing cpu balance itself first
> > and bail out as it currently does, but let it kick the next nohz_idle
> > cpu to continue the job if it thinks there is more work to be done. So
> > you would get a chain of kicks that would stop when there nothing
> > more to do be done. It isn't quite as fast as your solution as it would
>
> I am afraid there is more to this. If a given CPU is unable to pull
> tasks, it could mean that it is an unworthy destination CPU. But it does
> not mean that the other idle CPUs are unworthy of balancing too.
>
> So if the ILB CPU stops waking up idle CPUs when it has nothing to pull,
> we will end up hurting load balancing. Take for example the scenario
> described in the changelog. The idle CPUs within a numa node may find
> load balanced within themselves and hence refrain from pulling any load.
> If these ILB CPUs stop nohz idle load balancing at this point, the load
> will never get spread across nodes.
>
> If on the other hand, if we keep kicking idle CPUs to carry on idle load
> balancing, the wakeup scenario will be no better than it is with this patch.

By more work to be done I didn't mean stopping when the balancer cpu
gives up, I meant stopping the kick chain when there nothing more to be
balanced/pulled (or reasonably close). For example use something like
the nohz_kick_needed() checks on the source cpu/group and stopping if
all cpus only have one runnable task. At least try to stop waking an
extra cpu when there is clearly no point in doing so.

> > require an IPI plus wakeup for each cpu to continue the work. But it
> > should be much faster than the current code I think.
> >
> > IMHO it makes more sense to stay with the current scheme of ensuring
> > that the kicked cpu is actually used before waking up more cpus and
> > instead improve how additional cpus are kicked if they are needed.
>
> It looks more sensible to do this in parallel. The scenario on POWER is
> that tasks don't spread out across nodes until 10s of fork. This is
> unforgivable and we cannot afford the code to be the way it is today.

You propose having multiple balancing cpus running in parallel?

I fully agree that the nohz-balancing behaviour should be improved. It
could use a few changes to improve energy-awareness as well. IMHO,
taking a double hit every time we need to wake up an additional cpu is
inefficient and goes against all the effort put into reducing wake-ups
is essential for saving energy.

One thing that would help reducing energy consumption and that vendors
carry out-of-tree is improving find_new_ilb() to pick the cheapest cpu
to be kicked for nohz-idle balancing. However, this improvement is
pointless if we are going to wake up an additional cpu to receive the
task(s) that needs to be balanced.

I think a solution where we at least try to avoid waking up additional
cpus and vastly improves your 10s latency is possible.

Morten

Morten Rasmussen

unread,

Mar 30, 2015, 8:10:06 AM3/30/15

to

On Mon, Mar 30, 2015 at 12:06:32PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 27, 2015 at 05:56:51PM +0000, Morten Rasmussen wrote:
>
> > I agree that it is hard to predict how many additional cpus you need,
> > but I don't think you necessarily need that information as long as you
> > start by filling up the cpu that was kicked to do the
> > nohz_idle_balance() first.
>
> > Reducing unnecessary wakeups is quite important for energy consumption
> > and something a lot of effort is put into. You really don't want to wake
> > up another cluster/package unnecessarily just because there was only one
> > nohz-idle cpu left in the previous one which could have handled the
> > additional load. It gets even worse if the other cluster is less
> > energy-efficient (big.LITTLE).
>
> So the only way to get tasks to cross your cluster is by balancing that
> domain. At this point we'll compute sg stats for either group
> (=cluster).
>
> The only thing we need to ensure is that it doesn't view the small
> cluster as overloaded (as long as it really isn't of course), as long as
> its not viewed as overloaded it will not pull tasks from it into the big
> cluster, no matter how many ILBs we run before the ILB duty cpu's
> rebalance_domains() call.
>
> I'm really not seeing the problem here.

I see. The group_classify() should take care of it in all cases of
balancing across clusters. You would be iterating over all cpus in the
other cluster running rebalance_domains() if the balancer cpu happens to
be the last one in the little cluster though. However, within the
cluster (in case you have 2 or more nohz-idle cpus) you still take a
double hit. No?

Peter Zijlstra

unread,

Mar 30, 2015, 8:30:08 AM3/30/15

to

It can yes, but typically not I think. This all could use some 'help'
for sure.

So the thing is, find_new_ilb() simply selects the first idle_cpus_mask
cpu, while at the same time, nohz_idle_balance() will iterate the
idle_cpus_mask with the first, being first (obviously).

So it is very like that if we migrate on the ILB it is in fact to the
current CPU.

In case we cannot, we have no choice but to wake up a second idle,
nothing really to be done about that.

To put it another way, for ILB purposes the rebalance_domains() call is
mostly superfluous. The only other case is if the selected ILB target
became non-idle between being selected and getting to run the softirq
handler. At which point we should wake another anyhow too.

Maybe something like the below helps -- albeit it could use a comment
too of course.

---
kernel/sched/fair.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26eb7218..b879d4b3b599 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7624,11 +7624,12 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
* In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
* rebalancing for all the cpus for whom scheduler ticks are stopped.
*/
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
{
int this_cpu = this_rq->cpu;
- struct rq *rq;
int balance_cpu;
+ struct rq *rq;
+ bool done = false;

if (idle != CPU_IDLE ||
!test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))

@@ -7647,6 +7648,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
break;

rq = cpu_rq(balance_cpu);
+ if (rq == this_rq)
+ done = true;

/*
* If time for next balance is due,
@@ -7666,6 +7669,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
nohz.next_balance = this_rq->next_balance;
end:
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
+
+ return done;
}

/*
@@ -7744,7 +7749,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
return kick;
}
#else
-static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
+static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { return false; }
#endif

/*
@@ -7765,8 +7770,8 @@ static void run_rebalance_domains(struct softirq_action *h)

* load balance only within the local sched_domain hierarchy

* and abort nohz_idle_balance altogether if we pull some load.
*/

- nohz_idle_balance(this_rq, idle);
- rebalance_domains(this_rq, idle);
+ if (!nohz_idle_balance(this_rq, idle))
+ rebalance_domains(this_rq, idle);
}

/*

Peter Zijlstra

unread,

Mar 30, 2015, 9:00:07 AM3/30/15

to

On Mon, Mar 30, 2015 at 02:24:49PM +0200, Peter Zijlstra wrote:

> @@ -7647,6 +7648,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> break;
>
> rq = cpu_rq(balance_cpu);
> + if (rq == this_rq)
> + done = true;
>
> /*
> * If time for next balance is due,

Equally note that if you change find_new_ilb() to not pick the very
first cpu in the mask you already get the bouncy bounce, because this
routine will in fact try and move things to the first cpu first.

So if you go change find_new_ilb() you should also amend the iteration
order here.

Vincent Guittot

unread,

Mar 30, 2015, 9:30:06 AM3/30/15

to

AFAICT, this can't happen because we start the for_each _cpu loop with:
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
continue;

>
> /*
> * If time for next balance is due,
> @@ -7666,6 +7669,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> nohz.next_balance = this_rq->next_balance;
> end:
> clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> +
> + return done;
> }
>
> /*
> @@ -7744,7 +7749,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
> return kick;
> }
> #else
> -static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { return false; }
> #endif
>
> /*
> @@ -7765,8 +7770,8 @@ static void run_rebalance_domains(struct softirq_action *h)
> * load balance only within the local sched_domain hierarchy
> * and abort nohz_idle_balance altogether if we pull some load.
> */
> - nohz_idle_balance(this_rq, idle);
> - rebalance_domains(this_rq, idle);
> + if (!nohz_idle_balance(this_rq, idle))
> + rebalance_domains(this_rq, idle);

the nohz_idle_balance run rebalance_domains for all CPU except this CPU

Vincent Guittot

unread,

Mar 30, 2015, 9:50:05 AM3/30/15

to

On 26 March 2015 at 14:02, Preeti U Murthy <pre...@linux.vnet.ibm.com> wrote:
> When a CPU is kicked to do nohz idle balancing, it wakes up to do load
> balancing on itself, followed by load balancing on behalf of idle CPUs.
> But it may end up with load after the load balancing attempt on itself.
> This aborts nohz idle balancing. As a result several idle CPUs are left
> without tasks till such a time that an ILB CPU finds it unfavorable to
> pull tasks upon itself. This delays spreading of load across idle CPUs
> and worse, clutters only a few CPUs with tasks.
>
> The effect of the above problem was observed on an SMT8 POWER server
> with 2 levels of numa domains. Busy loops equal to number of cores were
> spawned. Since load balancing on fork/exec is discouraged across numa
> domains, all busy loops would start on one of the numa domains. However
> it was expected that eventually one busy loop would run per core across
> all domains due to nohz idle load balancing. But it was observed that it
> took as long as 10 seconds to spread the load across numa domains.

10sec is quite long. Have you checked how many load balance is needed
to spread the load on the system ? Are you using the default
min_interval and max_interval ?
The default period range is [sd_weight : 2*sd_weight] I don't know how
many CPUs you have but as an example, a system made of 128 CPUs will
do a load balance across the wide system each 128 jifffies if the CPU
is idle and each 4096 jiffies on a busy CPU. This could explain why
you need so much time to spread task across the system.

Vincent

Peter Zijlstra

unread,

Mar 30, 2015, 10:10:08 AM3/30/15

to

On Mon, Mar 30, 2015 at 03:29:09PM +0200, Vincent Guittot wrote:
> On 30 March 2015 at 14:24, Peter Zijlstra <pet...@infradead.org> wrote:
> > @@ -7647,6 +7648,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> > break;
> >
> > rq = cpu_rq(balance_cpu);
> > + if (rq == this_rq)
> > + done = true;
>
> AFAICT, this can't happen because we start the for_each _cpu loop with:
> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
> continue;

Oh hey, look at me being blind ;-)

Morten Rasmussen

unread,

Mar 30, 2015, 11:30:09 AM3/30/15

to

Apart from the issues that Vincent has already pointed out, including
the balancing of this_cpu in nohz_idle_balance() also means that we are
no longer ignoring rq->next_balance for the cpu being kicked. So we risk
kicking a cpu to do nohz_idle_balance which might ignore the fact that
we need an ILB (the kicker should have determined that already) due to
having done a load-balance recently.

Also, if we include this_cpu in nohz_idle_balance() and pick the first
nohz-idle cpu (as we currently do) we are back to where we were before
Preeti's patch. The balancer cpu will bail out if it pulls anything for
itself and not kick anybody to take over.

Jason Low

unread,

Mar 30, 2015, 3:00:06 PM3/30/15

to

Hi Preeti,

I noticed that another commit 4a725627f21d converted the check in
nohz_kick_needed() from idle_cpu() to rq->idle_balance, causing a
potentially outdated value to be used if this cpu is able to pull tasks
using rebalance_domains(), and nohz_kick_needed() directly returning
false.

Would this patch also help address some of the issue you are seeing?

Signed-off-by: Jason Low <jason...@hp.com>
---
kernel/sched/fair.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26e..ba8ec1a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7644,7 +7644,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
* balancing owner will pick it up.
*/
if (need_resched())
- break;
+ goto end;

rq = cpu_rq(balance_cpu);

@@ -7687,7 +7687,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
int nr_busy, cpu = rq->cpu;
bool kick = false;

- if (unlikely(rq->idle_balance))
+ if (unlikely(idle_cpu(cpu)))
return false;

/*
--
1.7.2.5

Preeti U Murthy

unread,

Mar 31, 2015, 4:10:05 AM3/31/15

to

On 03/30/2015 07:15 PM, Vincent Guittot wrote:
> On 26 March 2015 at 14:02, Preeti U Murthy <pre...@linux.vnet.ibm.com> wrote:
>> When a CPU is kicked to do nohz idle balancing, it wakes up to do load
>> balancing on itself, followed by load balancing on behalf of idle CPUs.
>> But it may end up with load after the load balancing attempt on itself.
>> This aborts nohz idle balancing. As a result several idle CPUs are left
>> without tasks till such a time that an ILB CPU finds it unfavorable to
>> pull tasks upon itself. This delays spreading of load across idle CPUs
>> and worse, clutters only a few CPUs with tasks.
>>
>> The effect of the above problem was observed on an SMT8 POWER server
>> with 2 levels of numa domains. Busy loops equal to number of cores were
>> spawned. Since load balancing on fork/exec is discouraged across numa
>> domains, all busy loops would start on one of the numa domains. However
>> it was expected that eventually one busy loop would run per core across
>> all domains due to nohz idle load balancing. But it was observed that it
>> took as long as 10 seconds to spread the load across numa domains.
>
> 10sec is quite long. Have you checked how many load balance is needed
> to spread the load on the system ? Are you using the default

The issue was that load balancing was not even being *attempted* due to
the above mentioned reason. The ILB CPU would pull load and abort
nohz_idle_ld_bal which was the only way load balancing could be
triggered on the idle CPUs. So it would take long to call load balancing
on idle CPUs after which it was quick to spread the load. There was
after all a stark imbalance between the load across the nodes.

> min_interval and max_interval ?
> The default period range is [sd_weight : 2*sd_weight] I don't know how
> many CPUs you have but as an example, a system made of 128 CPUs will
> do a load balance across the wide system each 128 jifffies if the CPU
> is idle and each 4096 jiffies on a busy CPU. This could explain why
> you need so much time to spread task across the system.

Yes I did suspect this in the beginning but I could reproduce the
problem even on a machine with few cores.

Regards
Preeti U Murthy

Preeti U Murthy

unread,

Mar 31, 2015, 4:40:08 AM3/31/15

to

Hi Jason,

On 03/31/2015 12:25 AM, Jason Low wrote:
> Hi Preeti,
>
> I noticed that another commit 4a725627f21d converted the check in
> nohz_kick_needed() from idle_cpu() to rq->idle_balance, causing a
> potentially outdated value to be used if this cpu is able to pull tasks
> using rebalance_domains(), and nohz_kick_needed() directly returning
> false.

I see that rebalance_domains() will be run at the end of the scheduler
tick interrupt handling. trigger_load_balance() only sets the softirq,
it does not call rebalance_domains() immediately. So the call graph
would be:

rq->idle_balance = idle_cpu()
|____trigger_load_balance()
|_____raise SCHED_SOFTIRQ - we are handling interrupt,hence defer
|____nohz_kick_needed()
|____rebalance_domains() run through the softirqd.

Correct me if I am wrong but since we do not pull any load between the
rq->idle_balance update and nohz_kick_needed(), we are safe in reading
rq->idle_balance in nohz_kick_needed().

>
> Would this patch also help address some of the issue you are seeing?
>
> Signed-off-by: Jason Low <jason...@hp.com>
> ---
> kernel/sched/fair.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fdae26e..ba8ec1a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7644,7 +7644,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> * balancing owner will pick it up.
> */
> if (need_resched())
> - break;
> + goto end;

Why is this hunk needed?

Regards
Preeti U Murthy

>
> rq = cpu_rq(balance_cpu);
>
> @@ -7687,7 +7687,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
> int nr_busy, cpu = rq->cpu;
> bool kick = false;
>
> - if (unlikely(rq->idle_balance))
> + if (unlikely(idle_cpu(cpu)))
> return false;
>
> /*
>

--

Preeti U Murthy

unread,

Mar 31, 2015, 5:00:07 AM3/31/15

to

Morten,

I am a bit confused about the problem you are pointing to.
nohz_idle_balance() does not kick the idle CPUs into action unless there
is work to be done. So there are no redundant wakeups. Hence I see no
problem here.

The ILB CPU is woken up to do the nohz idle balancing, but with this
patch, may end up with no work for itself at the end of
nohz_idle_balance() and return to sleep. That is one wakeup for merely
doing idle load balancing, but this wakeup is needed and worthwhile for
spreading the load. So I find no problem here too.

I am unable to see the issue. What is it that I am missing ?

Regards
Preeti U Murthy

Jason Low

unread,

Mar 31, 2015, 1:40:06 PM3/31/15

to

On Tue, 2015-03-31 at 14:28 +0530, Preeti U Murthy wrote:

> Morten,

> I am a bit confused about the problem you are pointing to.

> I am unable to see the issue. What is it that I am missing ?

Hi Preeti,

Here is one of the potential issues that have been described from my
understanding.

In situations where there are just a few tasks to pull (for example,
there's 1 task to move).

Before, if CPU 1 calls run_rebalance_domains(), we'll pull the tasks to
this CPU 1 (which is already awake) and run the task on CPU 1.

Now, we'll pull the task to some idle CPU 2 and wake up CPU 2 in order
for the task to run. Meanwhile, CPU 1 may go idle, instead of running
the task on CPU 1 which was already awake.

Jason Low

unread,

Mar 31, 2015, 3:00:07 PM3/31/15

to

On Tue, 2015-03-31 at 14:07 +0530, Preeti U Murthy wrote:
> Hi Jason,
>
> On 03/31/2015 12:25 AM, Jason Low wrote:
> > Hi Preeti,
> >
> > I noticed that another commit 4a725627f21d converted the check in
> > nohz_kick_needed() from idle_cpu() to rq->idle_balance, causing a
> > potentially outdated value to be used if this cpu is able to pull tasks
> > using rebalance_domains(), and nohz_kick_needed() directly returning
> > false.
>
> I see that rebalance_domains() will be run at the end of the scheduler
> tick interrupt handling. trigger_load_balance() only sets the softirq,
> it does not call rebalance_domains() immediately. So the call graph
> would be:

Oh right, since that only sets the softirq, this wouldn't be the issue,
though we would need these changes if we were to incorporate any sort of
nohz_kick_needed() logic into the nohz_idle_balance() code path correct?

Preeti U Murthy

unread,

Apr 1, 2015, 2:30:05 AM4/1/15

to

On 03/31/2015 11:00 PM, Jason Low wrote:
> On Tue, 2015-03-31 at 14:28 +0530, Preeti U Murthy wrote:
>
>> Morten,
>
>> I am a bit confused about the problem you are pointing to.
>
>> I am unable to see the issue. What is it that I am missing ?
>
> Hi Preeti,
>
> Here is one of the potential issues that have been described from my
> understanding.
>
> In situations where there are just a few tasks to pull (for example,
> there's 1 task to move).
>
> Before, if CPU 1 calls run_rebalance_domains(), we'll pull the tasks to
> this CPU 1 (which is already awake) and run the task on CPU 1.
>
> Now, we'll pull the task to some idle CPU 2 and wake up CPU 2 in order
> for the task to run. Meanwhile, CPU 1 may go idle, instead of running
> the task on CPU 1 which was already awake.
>

Alright I see. But it is one additional wake up. And the wake up will be
within the cluster. We will not wake up any CPU in the neighboring
cluster unless there are tasks to be pulled. So, we can wake up a core
out of a deep idle state and never a cluster in the problem described.
In terms of energy efficiency, this is not so bad a scenario, is it?

Regards
Preeti U Murthy

Preeti U Murthy

unread,

Apr 1, 2015, 3:00:07 AM4/1/15

to

On 04/01/2015 12:24 AM, Jason Low wrote:
> On Tue, 2015-03-31 at 14:07 +0530, Preeti U Murthy wrote:
>> Hi Jason,
>>
>> On 03/31/2015 12:25 AM, Jason Low wrote:
>>> Hi Preeti,
>>>
>>> I noticed that another commit 4a725627f21d converted the check in
>>> nohz_kick_needed() from idle_cpu() to rq->idle_balance, causing a
>>> potentially outdated value to be used if this cpu is able to pull tasks
>>> using rebalance_domains(), and nohz_kick_needed() directly returning
>>> false.
>>
>> I see that rebalance_domains() will be run at the end of the scheduler
>> tick interrupt handling. trigger_load_balance() only sets the softirq,
>> it does not call rebalance_domains() immediately. So the call graph
>> would be:
>
> Oh right, since that only sets the softirq, this wouldn't be the issue,
> though we would need these changes if we were to incorporate any sort of
> nohz_kick_needed() logic into the nohz_idle_balance() code path correct?

I am sorry I don't quite get this. Can you please elaborate?

Regards
Preeti U Murthy

Morten Rasmussen

unread,

Apr 1, 2015, 9:10:07 AM4/1/15

to

Hi Preeti and Jason,

On Wed, Apr 01, 2015 at 07:28:03AM +0100, Preeti U Murthy wrote:
> On 03/31/2015 11:00 PM, Jason Low wrote:
> > On Tue, 2015-03-31 at 14:28 +0530, Preeti U Murthy wrote:
> >
> >> Morten,
> >
> >> I am a bit confused about the problem you are pointing to.
> >
> >> I am unable to see the issue. What is it that I am missing ?
> >
> > Hi Preeti,
> >
> > Here is one of the potential issues that have been described from my
> > understanding.
> >
> > In situations where there are just a few tasks to pull (for example,
> > there's 1 task to move).
> >
> > Before, if CPU 1 calls run_rebalance_domains(), we'll pull the tasks to
> > this CPU 1 (which is already awake) and run the task on CPU 1.
> >
> > Now, we'll pull the task to some idle CPU 2 and wake up CPU 2 in order
> > for the task to run. Meanwhile, CPU 1 may go idle, instead of running
> > the task on CPU 1 which was already awake.

Yes. This is the scenario I had in mind although I might have failed to
make it crystal clear in my earlier replies.

> Alright I see. But it is one additional wake up. And the wake up will be
> within the cluster. We will not wake up any CPU in the neighboring
> cluster unless there are tasks to be pulled. So, we can wake up a core
> out of a deep idle state and never a cluster in the problem described.
> In terms of energy efficiency, this is not so bad a scenario, is it?

After Peter pointed out that it shouldn't happen across clusters due to
group_classify()/sg_capacity_factor() it isn't as bad as I initially
thought. It is still not an ideal solution I think. Wake-ups aren't nice
for battery-powered devices. Waking up a cpu in an already active
cluster may still imply powering up the core and bringing the L1 cache
into a usable state, but it isn't as bad as waking up a cluster. I would
prefer to avoid it if we can.

Thinking more about it, don't we also risk doing a lot of iterations in
nohz_idle_balance() leading to nothing (pure overhead) in certain corner
cases? If find_new_ild() is the last cpu in the cluster and we have one
task for each cpu in the cluster but one cpu is currently having two.
Don't we end up trying all nohz-idle cpus before giving up and balancing
the balancer cpu itself. On big machines, going through everyone could
take a while I think. No?

Morten

Morten Rasmussen

unread,

Apr 1, 2015, 1:10:06 PM4/1/15

to

On Wed, Apr 01, 2015 at 07:49:56AM +0100, Preeti U Murthy wrote:
>
> On 04/01/2015 12:24 AM, Jason Low wrote:
> > On Tue, 2015-03-31 at 14:07 +0530, Preeti U Murthy wrote:
> >> Hi Jason,
> >>
> >> On 03/31/2015 12:25 AM, Jason Low wrote:
> >>> Hi Preeti,
> >>>
> >>> I noticed that another commit 4a725627f21d converted the check in
> >>> nohz_kick_needed() from idle_cpu() to rq->idle_balance, causing a
> >>> potentially outdated value to be used if this cpu is able to pull tasks
> >>> using rebalance_domains(), and nohz_kick_needed() directly returning
> >>> false.
> >>
> >> I see that rebalance_domains() will be run at the end of the scheduler
> >> tick interrupt handling. trigger_load_balance() only sets the softirq,
> >> it does not call rebalance_domains() immediately. So the call graph
> >> would be:
> >
> > Oh right, since that only sets the softirq, this wouldn't be the issue,
> > though we would need these changes if we were to incorporate any sort of
> > nohz_kick_needed() logic into the nohz_idle_balance() code path correct?
>
> I am sorry I don't quite get this. Can you please elaborate?

I think the scenario is that we are in nohz_idle_balance() and decide to
bail out because we have pulled some tasks, but before leaving
nohz_idle_balance() we want to check if more balancing is necessary
using nohz_kick_needed() and potentially kick somebody to continue.

Note that the balance cpu is currently skipped in nohz_idle_balance(),
but if it wasn't the scenario would be possible.

In that case, we can't rely on rq->idle_balance as it would not be
up-to-date. Also, we may even want to use nohz_kick_needed(rq) where rq
!= this_rq, in which case we probably also want an updated status. It
seems that rq->idle_balance is only updated at each tick.

Or maybe I'm all wrong :)

Jason Low

unread,

Apr 1, 2015, 9:00:06 PM4/1/15

to

On Wed, 2015-04-01 at 14:03 +0100, Morten Rasmussen wrote:

Hi Morten,

> > Alright I see. But it is one additional wake up. And the wake up will be
> > within the cluster. We will not wake up any CPU in the neighboring
> > cluster unless there are tasks to be pulled. So, we can wake up a core
> > out of a deep idle state and never a cluster in the problem described.
> > In terms of energy efficiency, this is not so bad a scenario, is it?
>
> After Peter pointed out that it shouldn't happen across clusters due to
> group_classify()/sg_capacity_factor() it isn't as bad as I initially
> thought. It is still not an ideal solution I think. Wake-ups aren't nice
> for battery-powered devices. Waking up a cpu in an already active
> cluster may still imply powering up the core and bringing the L1 cache
> into a usable state, but it isn't as bad as waking up a cluster. I would
> prefer to avoid it if we can.

Right. I still think that the patch is justified if it addresses the 10
second latency issue, but if we could find a better solution, that would
be great :)

> Thinking more about it, don't we also risk doing a lot of iterations in
> nohz_idle_balance() leading to nothing (pure overhead) in certain corner
> cases? If find_new_ild() is the last cpu in the cluster and we have one
> task for each cpu in the cluster but one cpu is currently having two.
> Don't we end up trying all nohz-idle cpus before giving up and balancing
> the balancer cpu itself. On big machines, going through everyone could
> take a while I think. No?

Iterating through many CPUs could take a while, but since we only do
nohz_idle_balance() when the CPU is idle and exit if need_resched, then
we're only doing so if there is nothing else that needs to run.

Also, we're only attempting balancing when time_after_eq
rq->next_balance, so much of the time, we don't actually traverse all
the CPUs.

So this may not be too big of an issue.

Jason Low

unread,

Apr 1, 2015, 10:20:06 PM4/1/15

to

On Tue, 2015-03-31 at 14:07 +0530, Preeti U Murthy wrote:

> On 03/31/2015 12:25 AM, Jason Low wrote:

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fdae26e..ba8ec1a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7644,7 +7644,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> > * balancing owner will pick it up.
> > */
> > if (need_resched())
> > - break;
> > + goto end;
>
> Why is this hunk needed?

In terms of the change in the need_resched() case, if the current CPU
doesn't complete iterating all of the CPUs, then this will make it not
update nohz.next_balance. This is so we can continue the balancing with
the next balancing owner without too much delay.

Preeti U Murthy

unread,

Apr 1, 2015, 11:30:08 PM4/1/15

to

Hi Morten,

On 04/01/2015 06:33 PM, Morten Rasmussen wrote:
>> Alright I see. But it is one additional wake up. And the wake up will be
>> within the cluster. We will not wake up any CPU in the neighboring
>> cluster unless there are tasks to be pulled. So, we can wake up a core
>> out of a deep idle state and never a cluster in the problem described.
>> In terms of energy efficiency, this is not so bad a scenario, is it?
>
> After Peter pointed out that it shouldn't happen across clusters due to
> group_classify()/sg_capacity_factor() it isn't as bad as I initially
> thought. It is still not an ideal solution I think. Wake-ups aren't nice
> for battery-powered devices. Waking up a cpu in an already active
> cluster may still imply powering up the core and bringing the L1 cache
> into a usable state, but it isn't as bad as waking up a cluster. I would
> prefer to avoid it if we can.
>
> Thinking more about it, don't we also risk doing a lot of iterations in
> nohz_idle_balance() leading to nothing (pure overhead) in certain corner
> cases? If find_new_ild() is the last cpu in the cluster and we have one
> task for each cpu in the cluster but one cpu is currently having two.
> Don't we end up trying all nohz-idle cpus before giving up and balancing
> the balancer cpu itself. On big machines, going through everyone could
> take a while I think. No?

The balancer CPU will iterate as long as need_resched() is not set on
it. It may take a while, but if the balancer CPU has nothing to do, the
iteration will not be at a cost of performance.

Besides this, load balancing itself is optimized in terms of who does it
and how often. The candidate CPUs for load balancing are the first idle
CPUs in a given group. So nohz idle load balancing may abort on some of
the idle CPUs. If the CPUs on our left have not managed to pull tasks,
we abort load balancing too. We will save time here.

Load balancing on bigger sched domains is spaced out in time too. The
min_interval is equal to sd_weight and the balance_interval can be as
large as 2*sd_weight. This should ensure that load balancing across
large scheduling domains are not carried out too often. nohz_idle_load
balancing may therefore not go through the entire scheduling domain
hierarchy for each CPU. This will cut down on the time too.

Regards
Preeti U Murthy

Jason Low

unread,

Apr 1, 2015, 11:40:09 PM4/1/15

to

This scenario would also be possible if we call rebalance_domains()
first again.

I'm wondering if adding the nohz_kick_needed(), ect... in
nohz_idle_balance() can address the 10 second latency issue while still
calling rebalance_domains() first, since it seems more ideal to try
balancing on the current awake CPU first, as you also have mentioned

> In that case, we can't rely on rq->idle_balance as it would not be
> up-to-date. Also, we may even want to use nohz_kick_needed(rq) where rq
> != this_rq, in which case we probably also want an updated status. It
> seems that rq->idle_balance is only updated at each tick.

Yup, that's about what I was describing.

Jason Low

unread,

Apr 2, 2015, 2:10:07 AM4/2/15

to

On Wed, 2015-04-01 at 18:04 +0100, Morten Rasmussen wrote:

> On Wed, Apr 01, 2015 at 07:49:56AM +0100, Preeti U Murthy wrote:

> > I am sorry I don't quite get this. Can you please elaborate?
>
> I think the scenario is that we are in nohz_idle_balance() and decide to
> bail out because we have pulled some tasks, but before leaving
> nohz_idle_balance() we want to check if more balancing is necessary
> using nohz_kick_needed() and potentially kick somebody to continue.

Also, below is an example patch.

(Without the conversion to idle_cpu(), the check for rq->idle_balance
would not be accurate anymore)

---
kernel/sched/fair.c | 17 ++++++++++-------
1 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26e..7749a14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7620,6 +7620,8 @@ out:
}

#ifdef CONFIG_NO_HZ_COMMON
+static inline bool nohz_kick_needed(struct rq *rq);
+
/*

* In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
* rebalancing for all the cpus for whom scheduler ticks are stopped.

@@ -7629,6 +7631,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

int this_cpu = this_rq->cpu;

struct rq *rq;
int balance_cpu;

+ bool done_balancing = false;

if (idle != CPU_IDLE ||
!test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))

@@ -7644,7 +7647,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

* balancing owner will pick it up.
*/
if (need_resched())
- break;
+ goto end;

rq = cpu_rq(balance_cpu);

@@ -7663,9 +7666,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
if (time_after(this_rq->next_balance, rq->next_balance))
this_rq->next_balance = rq->next_balance;
}
+ done_balancing = true;

nohz.next_balance = this_rq->next_balance;
end:
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));

+ if (!done_balancing && nohz_kick_needed(this_rq))
+ nohz_balancer_kick();
}

/*
@@ -7687,7 +7693,7 @@ static inline bool nohz_kick_needed(struct rq *rq)

int nr_busy, cpu = rq->cpu;
bool kick = false;

- if (unlikely(rq->idle_balance))
+ if (unlikely(idle_cpu(cpu)))
return false;

/*

@@ -7757,16 +7763,13 @@ static void run_rebalance_domains(struct softirq_action *h)

enum cpu_idle_type idle = this_rq->idle_balance ?
CPU_IDLE : CPU_NOT_IDLE;

+ rebalance_domains(this_rq, idle);

/*
* If this cpu has a pending nohz_balance_kick, then do the
* balancing on behalf of the other idle cpus whose ticks are

- * stopped. Do nohz_idle_balance *before* rebalance_domains to
- * give the idle cpus a chance to load balance. Else we may
- * load balance only within the local sched_domain hierarchy
- * and abort nohz_idle_balance altogether if we pull some load.
+ * stopped.
*/

nohz_idle_balance(this_rq, idle);
- rebalance_domains(this_rq, idle);
}

/*
--
1.7.2.5

Morten Rasmussen

unread,

Apr 2, 2015, 4:50:07 AM4/2/15

to

Yes.

> I'm wondering if adding the nohz_kick_needed(), ect... in
> nohz_idle_balance() can address the 10 second latency issue while still
> calling rebalance_domains() first, since it seems more ideal to try
> balancing on the current awake CPU first, as you also have mentioned

I believe it could. That is where I was going with the chain-of-kicks
idea. I think the main cause of the unacceptable you are observing is
due to nohz_kicks only being issued at the tick. So if the balancer
pulls for itself first and bails out the next kick won't be issued until
the next tick or even multiple ticks later depending on
nohz.next_balance.

I haven't figured out if there is a reason for delaying the next
nohz_idle_balance() though.

Preeti U Murthy

unread,

Apr 2, 2015, 4:50:07 AM4/2/15

to

Ok this patch looks good. Let me test to find out if scheduling behavior
improves.

Regards
Preeti U Murthy

>
> /*

Morten Rasmussen

unread,

Apr 2, 2015, 5:20:05 AM4/2/15

to

I think this should reduce the latency Preeti is seeing and avoid
unnecessary wake-ups, however, it may not be quite as aggressive in
spreading tasks quickly. It will stop the chain-of-kicks as soon as the
balancer cpu has pulled only one task. The source cpu may still be
having two tasks and other cpus may still have more than two tasks
running.

Depending on how bad it is, we could consider kicking another cpu if the
imbalance is still significant after the balancer cpu has pulled a task.

Jason Low

unread,

Apr 2, 2015, 1:30:05 PM4/2/15

to

On Thu, 2015-04-02 at 10:17 +0100, Morten Rasmussen wrote:
> On Thu, Apr 02, 2015 at 06:59:07AM +0100, Jason Low wrote:

> > Also, below is an example patch.
> >
> > (Without the conversion to idle_cpu(), the check for rq->idle_balance
> > would not be accurate anymore)

> I think this should reduce the latency Preeti is seeing and avoid
> unnecessary wake-ups, however, it may not be quite as aggressive in
> spreading tasks quickly. It will stop the chain-of-kicks as soon as the
> balancer cpu has pulled only one task. The source cpu may still be
> having two tasks and other cpus may still have more than two tasks
> running.

Yeah, good point. I'll wait and see if Preeti finds this to improve
scheduling behavior. If this only helps a little though, we can also try
to make it more aggressive in spreading tasks.

Tim Chen

unread,

Apr 3, 2015, 6:40:06 PM4/3/15

to

I think we can get rid of the done_balancing boolean
and make it a bit easier to read if we change the above code to

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcfe320..08317dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7557,8 +7557,13 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
* work being done for other cpus. Next load

* balancing owner will pick it up.
*/

- if (need_resched())
- break;
+ if (need_resched()) {
+ /* preparing to bail, kicking other cpu to continue */
+ clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
+ if (nohz_kick_needed(this_rq))
+ nohz_balance_kick();
+ return;
+ }

rq = cpu_rq(balance_cpu);

Thanks.

Tim

Preeti U Murthy

unread,

Apr 4, 2015, 6:00:06 AM4/4/15

to

On 04/02/2015 11:29 AM, Jason Low wrote:

Solution 1: As exists in the mainline
Solution 2: nohz_idle_balance(); rebalance_domains() on the ILB CPU
Solution 3: Above patch.

I observe that Solution 3 is not as aggressive in spreading load as
Solution 2. With Solution 2, the load gets spread within the first 3-4
seconds, while with Solution3, the load gets spread within the first 6-7
seconds. I think this is because, the above patch decides to further
nohz_idle_load_balance() based on the load on the current ILB CPU which
has most likely pulled just one task. This will abort further load
balancing. However, Solution 3 is certainly better at spreading load
than Solution 1.

Wrt IPIs, I see that Solution 2 results in increase in the number of
IPIs by around 2% over Solution 3, probably for the same reason that
Morten pointed out.

Regards
Preeti U Murthy

Jason Low

unread,

Apr 7, 2015, 1:50:07 PM4/7/15

to

On Fri, 2015-04-03 at 15:35 -0700, Tim Chen wrote:
> I think we can get rid of the done_balancing boolean
> and make it a bit easier to read if we change the above code to
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bcfe320..08317dc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7557,8 +7557,13 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> * work being done for other cpus. Next load
> * balancing owner will pick it up.
> */
> - if (need_resched())
> - break;
> + if (need_resched()) {
> + /* preparing to bail, kicking other cpu to continue */
> + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> + if (nohz_kick_needed(this_rq))
> + nohz_balance_kick();
> + return;
> + }

Hi Tim,

We would also need the nohz_kick_needed/nohz_balance_kick if we
initially find that the current CPU is not idle (at the beginning of
nohz_idle_balance). In the above case, we would need to add the code to
2 locations.

Would it be better to still keep the done_balancing to avoid having
duplicate code?

Tim Chen

unread,

Apr 7, 2015, 3:40:05 PM4/7/15

to

How about consolidating the code for passing the
nohz balancing and call it at both places.
Something like below. Make the code more readable.

Tim

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40667cb..16f6904 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7531,6 +7531,15 @@ out:
}

#ifdef CONFIG_NO_HZ_COMMON
+static inline int nohz_kick_needed(struct rq *rq);
+
+static void inline pass_nohz_balance(struct rq *this_rq, int this_cpu)
+{

+ clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
+ if (nohz_kick_needed(this_rq))

+ nohz_balancer_kick();
+}

+
/*
* In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
* rebalancing for all the cpus for whom scheduler ticks are stopped.

@@ -7542,8 +7551,10 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
int balance_cpu;

if (idle != CPU_IDLE ||
- !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
- goto end;
+ !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu))) {
+ pass_nohz_balance(this_rq, this_cpu);
+ return;
+ }

for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {

if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))

@@ -7554,8 +7565,10 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

* work being done for other cpus. Next load
* balancing owner will pick it up.
*/
- if (need_resched())
- break;
+ if (need_resched()) {

+ pass_nohz_balance(this_rq, this_cpu);

+ return;
+ }

rq = cpu_rq(balance_cpu);

@@ -7575,7 +7588,6 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

this_rq->next_balance = rq->next_balance;
}

nohz.next_balance = this_rq->next_balance;
-end:
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));

Jason Low

unread,

Apr 7, 2015, 4:30:06 PM4/7/15

to

Sure, this can make it more readable. This also avoids the need for the
goto, in addition to removing the done_balancing boolean.

Thanks,
Jason

Jason Low

unread,

Apr 7, 2015, 7:30:09 PM4/7/15

to

On Sat, 2015-04-04 at 15:29 +0530, Preeti U Murthy wrote:

> Solution 1: As exists in the mainline
> Solution 2: nohz_idle_balance(); rebalance_domains() on the ILB CPU
> Solution 3: Above patch.
>
> I observe that Solution 3 is not as aggressive in spreading load as
> Solution 2. With Solution 2, the load gets spread within the first 3-4
> seconds,

hmm, so 3-4 seconds still sounds like a long time.

> while with Solution3, the load gets spread within the first 6-7
> seconds. I think this is because, the above patch decides to further
> nohz_idle_load_balance() based on the load on the current ILB CPU which
> has most likely pulled just one task.

Okay, so perhaps we can also try continuing nohz load balancing if we
find that there are overloaded CPUs in the system.

Jason Low

unread,

Apr 7, 2015, 8:10:06 PM4/7/15

to

On Tue, 2015-04-07 at 16:28 -0700, Jason Low wrote:

> Okay, so perhaps we can also try continuing nohz load balancing if we
> find that there are overloaded CPUs in the system.

Something like the following.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdae26e..d636bf7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7620,6 +7620,16 @@ out:

}

#ifdef CONFIG_NO_HZ_COMMON
+static inline bool nohz_kick_needed(struct rq *rq);
+

+static inline void pass_nohz_balance(struct rq *this_rq, int this_cpu)
+{
+ clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
+ nohz.next_balance = jiffies;

+ if (nohz_kick_needed(this_rq))
+ nohz_balancer_kick();
+}

+
/*
* In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
* rebalancing for all the cpus for whom scheduler ticks are stopped.

@@ -7631,8 +7641,10 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

int balance_cpu;

if (idle != CPU_IDLE ||
- !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
- goto end;
+ !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu))) {
+ pass_nohz_balance(this_rq, this_cpu);
+ return;
+ }

for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))

@@ -7643,8 +7655,10 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

* work being done for other cpus. Next load

* balancing owner will pick it up.
*/

- if (need_resched())
- break;
+ if (need_resched()) {

+ pass_nohz_balance(this_rq, this_cpu);
+ return;
+ }

rq = cpu_rq(balance_cpu);

@@ -7664,7 +7678,6 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

this_rq->next_balance = rq->next_balance;
}

nohz.next_balance = this_rq->next_balance;
-end:
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
}

@@ -7687,7 +7700,7 @@ static inline bool nohz_kick_needed(struct rq *rq)

int nr_busy, cpu = rq->cpu;
bool kick = false;

- if (unlikely(rq->idle_balance))
+ if (unlikely(idle_cpu(cpu)))
return false;

/*

@@ -7707,7 +7720,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
if (time_before(now, nohz.next_balance))
return false;

- if (rq->nr_running >= 2)
+ if (rq->nr_running >= 2 || rq->rd->overload)
return true;

rcu_read_lock();
@@ -7757,16 +7770,14 @@ static void run_rebalance_domains(struct softirq_action *h)

enum cpu_idle_type idle = this_rq->idle_balance ?
CPU_IDLE : CPU_NOT_IDLE;

+ rebalance_domains(this_rq, idle);

+

/*
* If this cpu has a pending nohz_balance_kick, then do the
* balancing on behalf of the other idle cpus whose ticks are
- * stopped. Do nohz_idle_balance *before* rebalance_domains to
- * give the idle cpus a chance to load balance. Else we may
- * load balance only within the local sched_domain hierarchy
- * and abort nohz_idle_balance altogether if we pull some load.
+ * stopped.
*/
nohz_idle_balance(this_rq, idle);
- rebalance_domains(this_rq, idle);
}

/*

Srikar Dronamraju

unread,

Apr 8, 2015, 7:20:06 AM4/8/15

to

* Jason Low <jason...@hp.com> [2015-04-07 17:07:46]:

> On Tue, 2015-04-07 at 16:28 -0700, Jason Low wrote:
>
> > Okay, so perhaps we can also try continuing nohz load balancing if we
> > find that there are overloaded CPUs in the system.
>
> Something like the following.
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fdae26e..d636bf7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7620,6 +7620,16 @@ out:
> }
>
> #ifdef CONFIG_NO_HZ_COMMON
> +static inline bool nohz_kick_needed(struct rq *rq);
> +
> +static inline void pass_nohz_balance(struct rq *this_rq, int this_cpu)
> +{
> + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> + nohz.next_balance = jiffies;

Why are we updating nohz.next_balance here?

> + if (nohz_kick_needed(this_rq))
> + nohz_balancer_kick();
> +}
> +
> /*
> * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
> * rebalancing for all the cpus for whom scheduler ticks are stopped.
> @@ -7631,8 +7641,10 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> int balance_cpu;
>
> if (idle != CPU_IDLE ||

Would it make sense to add need_resched here like
http://mid.gmane.org/1427442750-8112-1-git...@linux.intel.com

> - !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
> - goto end;
> + !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu))) {
> + pass_nohz_balance(this_rq, this_cpu);
> + return;
> + }
>
> for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))

> @@ -7687,7 +7700,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
> int nr_busy, cpu = rq->cpu;
> bool kick = false;
>
> - if (unlikely(rq->idle_balance))
> + if (unlikely(idle_cpu(cpu)))
> return false;

The only other place that we use idle_balance is
run_rebalance_domains(). Would it make sense to just use idle_cpu() in
run_rebalance_domains() and remove rq->idle_balance?

Thanks and Regards
Srikar Dronamraju

Jason Low

unread,

Apr 8, 2015, 5:30:08 PM4/8/15

to

On Wed, 2015-04-08 at 16:42 +0530, Srikar Dronamraju wrote:
> * Jason Low <jason...@hp.com> [2015-04-07 17:07:46]:
>
> > On Tue, 2015-04-07 at 16:28 -0700, Jason Low wrote:
> >
> > > Okay, so perhaps we can also try continuing nohz load balancing if we
> > > find that there are overloaded CPUs in the system.
> >
> > Something like the following.
> >
> > ---
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fdae26e..d636bf7 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7620,6 +7620,16 @@ out:
> > }
> >
> > #ifdef CONFIG_NO_HZ_COMMON
> > +static inline bool nohz_kick_needed(struct rq *rq);
> > +
> > +static inline void pass_nohz_balance(struct rq *this_rq, int this_cpu)
> > +{
> > + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> > + nohz.next_balance = jiffies;
>
> Why are we updating nohz.next_balance here?

This was just to make sure that since we're continuing the balancing on
another CPU that the nohz next_balance is guaranteed to be "now".

> > + if (nohz_kick_needed(this_rq))
> > + nohz_balancer_kick();
> > +}
> > +
> > /*
> > * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
> > * rebalancing for all the cpus for whom scheduler ticks are stopped.
> > @@ -7631,8 +7641,10 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> > int balance_cpu;
> >
> > if (idle != CPU_IDLE ||
>
> Would it make sense to add need_resched here like
> http://mid.gmane.org/1427442750-8112-1-git...@linux.intel.com

Yeah, we could have incorporated adding the need_resched there too for
testing purposes.

Though that probably wouldn't make too much of a difference in
performance with this patch, since this also modified the need_resched()
check in the loop + nohz.next_balance. So I think it would still be fine
to test this without the added need_resched().

Jason Low

unread,

Apr 8, 2015, 10:40:06 PM4/8/15

to

On Wed, 2015-04-08 at 16:42 +0530, Srikar Dronamraju wrote:

> * Jason Low <jason...@hp.com> [2015-04-07 17:07:46]:

> > @@ -7687,7 +7700,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
> > int nr_busy, cpu = rq->cpu;
> > bool kick = false;
> >
> > - if (unlikely(rq->idle_balance))
> > + if (unlikely(idle_cpu(cpu)))
> > return false;
>
>
> The only other place that we use idle_balance is
> run_rebalance_domains(). Would it make sense to just use idle_cpu() in
> run_rebalance_domains() and remove rq->idle_balance?

Hi Srikar,

So the idle_balance is used for storing the idle state of the CPU before
calling the softirq, for load balancing decisions. In that case, we may
need to keep this extra variable in order to store that information.

Srikar Dronamraju

unread,

Apr 9, 2015, 3:10:06 AM4/9/15

to

* Jason Low <jason...@hp.com> [2015-04-08 19:39:15]:

> On Wed, 2015-04-08 at 16:42 +0530, Srikar Dronamraju wrote:
> > * Jason Low <jason...@hp.com> [2015-04-07 17:07:46]:
> > > @@ -7687,7 +7700,7 @@ static inline bool nohz_kick_needed(struct rq *rq)
> > > int nr_busy, cpu = rq->cpu;
> > > bool kick = false;
> > >
> > > - if (unlikely(rq->idle_balance))
> > > + if (unlikely(idle_cpu(cpu)))
> > > return false;
> >
> >
> > The only other place that we use idle_balance is
> > run_rebalance_domains(). Would it make sense to just use idle_cpu() in
> > run_rebalance_domains() and remove rq->idle_balance?
>
> Hi Srikar,
>
> So the idle_balance is used for storing the idle state of the CPU before
> calling the softirq, for load balancing decisions. In that case, we may
> need to keep this extra variable in order to store that information.
>

I am not sure if you got what I wanted to convey.

rq->idle_balance gets updated at every scheduler_tick() but the only user of
rq->idle_balance (after your change) seems to be run_rebalance_domains().
Now can we remove rq->idle_balance. This would mean we would have to
call idle_cpu() instead of using rq->idle_balance in
run_rebalance_domains(). (similar to what your above change)

That way we can reduce the rq struct size and we might end up calling
idle_cpu() lesser number of times.

--
Thanks and Regards
Srikar Dronamraju

Jason Low

unread,

Apr 9, 2015, 6:50:06 PM4/9/15

to

On Thu, 2015-04-09 at 12:32 +0530, Srikar Dronamraju wrote:

> rq->idle_balance gets updated at every scheduler_tick() but the only user of
> rq->idle_balance (after your change) seems to be run_rebalance_domains().
> Now can we remove rq->idle_balance. This would mean we would have to
> call idle_cpu() instead of using rq->idle_balance in
> run_rebalance_domains(). (similar to what your above change)
>
> That way we can reduce the rq struct size and we might end up calling
> idle_cpu() lesser number of times.

Yeah, we may also include another patch for that.

Taking a look at rebalance_domains(), we're already updating the "idle"
value using idle_cpu() after attempting load balancing anyway, so there
may not be much point in the extra parameter.

Srikar Dronamraju

unread,

Apr 10, 2015, 4:40:06 AM4/10/15

to

> > >
> > > #ifdef CONFIG_NO_HZ_COMMON
> > > +static inline bool nohz_kick_needed(struct rq *rq);
> > > +
> > > +static inline void pass_nohz_balance(struct rq *this_rq, int this_cpu)
> > > +{
> > > + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> > > + nohz.next_balance = jiffies;
> >
> > Why are we updating nohz.next_balance here?
>
> This was just to make sure that since we're continuing the balancing on
> another CPU that the nohz next_balance is guaranteed to be "now".
>

Since we are in nohz_idle_balance(), nohz.next_balance is guaranteed be
less than now. We do check for time_before(now, nohz.next_balance) in
nohz_kick_needed(). So in effect we are incrementing the nohz.next_balance.

While updating nohz.next_balance may not cause any issues, it atleast
look redundant to me.

At this point, I also wanted to understand why we do
"nohz.next_balance++" nohz_balancer_kick()?

> > > + if (nohz_kick_needed(this_rq))
> > > + nohz_balancer_kick();
> > > +}
> > > +
> > > /*

--
Thanks and Regards
Srikar Dronamraju

Preeti U Murthy

unread,

Apr 13, 2015, 2:20:07 AM4/13/15

to

Hi Jason,

On 04/08/2015 05:37 AM, Jason Low wrote:
> On Tue, 2015-04-07 at 16:28 -0700, Jason Low wrote:
>
>> Okay, so perhaps we can also try continuing nohz load balancing if we
>> find that there are overloaded CPUs in the system.

Sorry about the delay. Ok I will test out the below patch and share the
results.

Regards
Preeti U Murthy

Jason Low

unread,

Apr 13, 2015, 3:00:06 PM4/13/15

to

On Fri, 2015-04-10 at 14:07 +0530, Srikar Dronamraju wrote:
> > > >
> > > > #ifdef CONFIG_NO_HZ_COMMON
> > > > +static inline bool nohz_kick_needed(struct rq *rq);
> > > > +
> > > > +static inline void pass_nohz_balance(struct rq *this_rq, int this_cpu)
> > > > +{
> > > > + clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
> > > > + nohz.next_balance = jiffies;
> > >
> > > Why are we updating nohz.next_balance here?
> >
> > This was just to make sure that since we're continuing the balancing on
> > another CPU that the nohz next_balance is guaranteed to be "now".
> >
>
> Since we are in nohz_idle_balance(), nohz.next_balance is guaranteed be
> less than now. We do check for time_before(now, nohz.next_balance) in
> nohz_kick_needed(). So in effect we are incrementing the nohz.next_balance.

Hi Srikar,

If now is equal to nohz.next_balance, we may attempt
nohz_balancer_kick().

After it does nohz.next_balance++ in nohz_balancer_kick(), now can be 1
less than the new nohz.next_balance value by the time
nohz_idle_balance() is attempted(). Without updating nohz.next_balance,
the time_before(now, nohz.next_balance) check in nohz_kick_needed() may
cause it to return false.

Jason Low

unread,

Apr 13, 2015, 5:00:06 PM4/13/15

to

On Fri, 2015-04-10 at 14:07 +0530, Srikar Dronamraju wrote:

> At this point, I also wanted to understand why we do
> "nohz.next_balance++" nohz_balancer_kick()?

So this looks like something that was added to avoid
nohz_balancer_kick() getting called too frequently. Otherwise, it may
get called in each trigger_load_balance(), even when another CPU has
already been kicked to do balancing.

Jason Low

unread,

Apr 13, 2015, 6:50:06 PM4/13/15

to

hmm, so taking a look at the patch again, it looks like we pass nohz
balance even when the NOHZ_BALANCE_KICK is not set on the current CPU.
We should separate the 2 conditions:

if (!test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
return;

if (idle != CPU_IDLE) {
/* another CPU continue balancing */
pass_nohz_balance(this_rq, this_cpu);
return;
}

In general, separating the check also optimizes nohz_idle_balance() to
avoid clearing the bit when it is not set.

Jason Low

unread,

Apr 13, 2015, 11:10:06 PM4/13/15

to

On Mon, 2015-04-13 at 15:49 -0700, Jason Low wrote:

> hmm, so taking a look at the patch again, it looks like we pass nohz
> balance even when the NOHZ_BALANCE_KICK is not set on the current CPU.
> We should separate the 2 conditions:
>
> if (!test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
> return;
>
> if (idle != CPU_IDLE) {
> /* another CPU continue balancing */
> pass_nohz_balance(this_rq, this_cpu);
> return;
> }

Here's the example patch with the above update.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

index ffeaa41..9aa48f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7622,6 +7622,16 @@ out:

}

#ifdef CONFIG_NO_HZ_COMMON
+static inline bool nohz_kick_needed(struct rq *rq);
+
+static inline void pass_nohz_balance(struct rq *this_rq, int this_cpu)
+{
+ clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
+ nohz.next_balance = jiffies;
+ if (nohz_kick_needed(this_rq))
+ nohz_balancer_kick();
+}
+
/*
* In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
* rebalancing for all the cpus for whom scheduler ticks are stopped.

@@ -7632,9 +7642,13 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

struct rq *rq;
int balance_cpu;

- if (idle != CPU_IDLE ||

- !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
- goto end;

+ if (!test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu)))
+ return;
+
+ if (idle != CPU_IDLE) {

+ pass_nohz_balance(this_rq, this_cpu);
+ return;
+ }

for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))

@@ -7645,8 +7659,10 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

* work being done for other cpus. Next load
* balancing owner will pick it up.
*/
- if (need_resched())
- break;
+ if (need_resched()) {

+ pass_nohz_balance(this_rq, this_cpu);
+ return;
+ }

rq = cpu_rq(balance_cpu);

@@ -7666,7 +7682,6 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)

this_rq->next_balance = rq->next_balance;
}
nohz.next_balance = this_rq->next_balance;
-end:
clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
}

@@ -7689,7 +7704,7 @@ static inline bool nohz_kick_needed(struct rq *rq)

int nr_busy, cpu = rq->cpu;
bool kick = false;

- if (unlikely(rq->idle_balance))
+ if (unlikely(idle_cpu(cpu)))
return false;

/*

@@ -7709,7 +7724,7 @@ static inline bool nohz_kick_needed(struct rq *rq)

if (time_before(now, nohz.next_balance))
return false;

- if (rq->nr_running >= 2)
+ if (rq->nr_running >= 2 || rq->rd->overload)
return true;

rcu_read_lock();

@@ -7759,16 +7774,14 @@ static void run_rebalance_domains(struct softirq_action *h)

enum cpu_idle_type idle = this_rq->idle_balance ?
CPU_IDLE : CPU_NOT_IDLE;

+ rebalance_domains(this_rq, idle);
+
/*
* If this cpu has a pending nohz_balance_kick, then do the
* balancing on behalf of the other idle cpus whose ticks are
- * stopped. Do nohz_idle_balance *before* rebalance_domains to
- * give the idle cpus a chance to load balance. Else we may
- * load balance only within the local sched_domain hierarchy
- * and abort nohz_idle_balance altogether if we pull some load.
+ * stopped.
*/
nohz_idle_balance(this_rq, idle);
- rebalance_domains(this_rq, idle);
}

/*