[REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period

Chase Douglas

unread,

Mar 29, 2010, 9:50:02 AM3/29/10

to

There's a period of 10 ticks where calc_load_tasks is updated by all the
cpus for the load avg. Usually all the cpus do this during the first
tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
However, if they wake up calc_load_tasks is not incremented. Thus, if
cpus go idle during the 10 tick period, calc_load_tasks may be
decremented to a non-representative value. This issue can lead to
systems having a load avg of exactly 0, even though the real load avg
could theoretically be up to NR_CPUS.

This change defers calc_load_tasks accounting after each cpu updates the
count until after the 10 tick period.

BugLink: http://bugs.launchpad.net/bugs/513848

Signed-off-by: Chase Douglas <chase....@canonical.com>
---
kernel/sched.c | 16 ++++++++++++++--
1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 9ab3cd7..c0aedac 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3064,7 +3064,8 @@ void calc_global_load(void)
*/
static void calc_load_account_active(struct rq *this_rq)
{
- long nr_active, delta;
+ static atomic_long_t deferred;
+ long nr_active, delta, deferred_delta;

nr_active = this_rq->nr_running;
nr_active += (long) this_rq->nr_uninterruptible;
@@ -3072,6 +3073,17 @@ static void calc_load_account_active(struct rq *this_rq)
if (nr_active != this_rq->calc_load_active) {
delta = nr_active - this_rq->calc_load_active;
this_rq->calc_load_active = nr_active;
+
+ /* Need to defer idle accounting during load update period: */
+ if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
+ time_after_eq(jiffies, calc_load_update))) {
+ atomic_long_add(delta, &deferred);
+ return;
+ }
+
+ deferred_delta = atomic_long_xchg(&deferred, 0);
+ delta += deferred_delta;
+
atomic_long_add(delta, &calc_load_tasks);
}
}
@@ -3106,8 +3118,8 @@ static void update_cpu_load(struct rq *this_rq)
}

if (time_after_eq(jiffies, this_rq->calc_load_update)) {
- this_rq->calc_load_update += LOAD_FREQ;
calc_load_account_active(this_rq);
+ this_rq->calc_load_update += LOAD_FREQ;
}
}

--
1.6.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,

Mar 29, 2010, 10:50:02 AM3/29/10

to

On Mon, 2010-03-29 at 09:41 -0400, Chase Douglas wrote:
> There's a period of 10 ticks where calc_load_tasks is updated by all the
> cpus for the load avg. Usually all the cpus do this during the first
> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> However, if they wake up calc_load_tasks is not incremented. Thus, if
> cpus go idle during the 10 tick period, calc_load_tasks may be
> decremented to a non-representative value. This issue can lead to
> systems having a load avg of exactly 0, even though the real load avg
> could theoretically be up to NR_CPUS.
>
> This change defers calc_load_tasks accounting after each cpu updates the
> count until after the 10 tick period.

>From reading the above changelog it seems to me there should be a
callback from leaving nohz mode, your proposed patch has no such thing.

Chase Douglas

unread,

Mar 29, 2010, 1:30:02 PM3/29/10

to

On Mon, Mar 29, 2010 at 10:41 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Mon, 2010-03-29 at 09:41 -0400, Chase Douglas wrote:
>> There's a period of 10 ticks where calc_load_tasks is updated by all the
>> cpus for the load avg. Usually all the cpus do this during the first
>> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
>> However, if they wake up calc_load_tasks is not incremented. Thus, if
>> cpus go idle during the 10 tick period, calc_load_tasks may be
>> decremented to a non-representative value. This issue can lead to
>> systems having a load avg of exactly 0, even though the real load avg
>> could theoretically be up to NR_CPUS.
>>
>> This change defers calc_load_tasks accounting after each cpu updates the
>> count until after the 10 tick period.
>
> >From reading the above changelog it seems to me there should be a
> callback from leaving nohz mode, your proposed patch has no such thing.

I believe what you're implying is that there should be a corresponding
call for when a cpu is awakened to counter the accounting when a cpu
goes to sleep during the 10 tick window. I don't think this is the
correct approach because the load avg should be a snapshot in time. If
there were 10 runnable tasks at the beginning of the load calculation
window, then we should account for 10 tasks. If they all get serviced
and the cpu goes to sleep, then wakes back up with one runnable task,
we would account for only one task instead of the original 10. What if
9 more tasks then get added to the cpu's rq during the 10 tick update
period? None of them would be accounted for either.

-- Chase

Andrew Morton

unread,

Apr 1, 2010, 3:30:01 PM4/1/10

to

On Mon, 29 Mar 2010 09:41:12 -0400
Chase Douglas <chase....@canonical.com> wrote:

> There's a period of 10 ticks where calc_load_tasks is updated by all the
> cpus for the load avg. Usually all the cpus do this during the first
> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> However, if they wake up calc_load_tasks is not incremented. Thus, if
> cpus go idle during the 10 tick period, calc_load_tasks may be
> decremented to a non-representative value. This issue can lead to
> systems having a load avg of exactly 0, even though the real load avg
> could theoretically be up to NR_CPUS.
>
> This change defers calc_load_tasks accounting after each cpu updates the
> count until after the 10 tick period.
>
> BugLink: http://bugs.launchpad.net/bugs/513848
>

There was useful information in the [patch 0/1] email, such as the
offending commit ID. If possible, it's best to avoid the [patch 0/n]
thing altogether - that information either has to be moved into the
[patch 1/n] changelog by someone (ie: me), or it just gets ommitted and
lost.

That seems a sensible way to avoid the gross-looking "10 ticks" thing.

What was the reason for "update the avenrun load estimates 10 ticks
after the CPUs have updated calc_load_tasks"? Can we do something
smarter there to fix this?

> + deferred_delta = atomic_long_xchg(&deferred, 0);
> + delta += deferred_delta;
> +
> atomic_long_add(delta, &calc_load_tasks);
> }
> }

The global `deferred' is unfortunate from a design and possibly
scalability POV. Can it be moved into the `struct rq'? That way it
can become a plain old `unsigned long', too.

Thomas Gleixner

unread,

Apr 1, 2010, 3:40:02 PM4/1/10

to

The reason was to make sure that all cpus have updated their stuff
halfways before we start to calculate and we spread out stuff among
cpus to avoid contention on those global atomic variables.

Yes, we can do something smarter. First thing is to fix the
nr_uninterruptible accounting which can become negative. Peter looked
into that already and we really need to fix this first before doing
anything smart about that load avg stuff.

Thanks,

tglx

Chase Douglas

unread,

Apr 1, 2010, 4:00:02 PM4/1/10

to

On Thu, Apr 1, 2010 at 3:27 PM, Andrew Morton <ak...@linux-foundation.org> wrote:
> On Mon, 29 Mar 2010 09:41:12 -0400
> Chase Douglas <chase....@canonical.com> wrote:
>
>> There's a period of 10 ticks where calc_load_tasks is updated by all the
>> cpus for the load avg. Usually all the cpus do this during the first
>> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
>> However, if they wake up calc_load_tasks is not incremented. Thus, if
>> cpus go idle during the 10 tick period, calc_load_tasks may be
>> decremented to a non-representative value. This issue can lead to
>> systems having a load avg of exactly 0, even though the real load avg
>> could theoretically be up to NR_CPUS.
>>
>> This change defers calc_load_tasks accounting after each cpu updates the
>> count until after the 10 tick period.
>>
>> BugLink: http://bugs.launchpad.net/bugs/513848
>>
>
> There was useful information in the [patch 0/1] email, such as the
> offending commit ID. If possible, it's best to avoid the [patch 0/n]
> thing altogether - that information either has to be moved into the
> [patch 1/n] changelog by someone (ie: me), or it just gets ommitted and
> lost.

I'll be sure to fix that up in any future versions.

The reason it's global is that any given cpu may go NOHZ idle. If they
sleep through the next load accounting, then their deferred value
won't get accounted for. By keeping it global and having it checked by
every cpu, we ensure that in the worst case where only one cpu is
awake to do the accounting, the deferred values from all cpus are
taken into account.

After a review by Andy Whitcroft on the Ubuntu kernel team, we decided
to do a few things to improve performance:

1. Check if values are non-zero before atomically touching the
deferred variable (load avg accounting is imprecise anyways, we just
need to make sure we don't lose any tasks)
2. Rename the deferred variable to calc_load_tasks_deferred and move
it right after calc_load_tasks to hopefully catch it in the same cache
line

Thomas Gleixner seems to think this isn't the best approach (later
email in this thread), so I'm deferring sending it unless someone is
still interested in this approach.

-- Chase

Chase Douglas

unread,

Apr 1, 2010, 4:10:01 PM4/1/10

to

I noticed this too, and figured it was some clever accounting :). If
this is a real bug, I'd be happy to take a look at it as well.

What do you have in mind as a smarter way of fixing this issue, and
why is it dependent on fixing the absolute values of each runqueue's
nr_uninterruptible count?

-- Chase

Thomas Gleixner

unread,

Apr 1, 2010, 4:10:01 PM4/1/10

to

Well, it's a good workaround for now I think, I just answered Andrews
question whether we can't do any better :)

Thanks,

tglx

Thomas Gleixner

unread,

Apr 1, 2010, 4:20:02 PM4/1/10

to

Well, the uninterruptible count imbalance is preventing us to calc the
load avg per cpu, which is something the power saving folks are
interested in.

If that is fixed we probably have a good chance to collect the per cpu
stuff and build a combined one - have not looked into the gory details
of the math yet.

Also we need to think about a more clever way than just accounting the
number of running and uninterruptible tasks. What we really want is to
use the numbers which the scheduler has handy, i.e how many tasks did
we run since we did the last loadavg calculation. It was always
possible (even before the big loadavg changes) to create a task which
consumes 50% of a CPU and never shows up in the loadavg calculations
at all.

Thanks,

tglx

Chase Douglas

unread,

Apr 1, 2010, 4:40:02 PM4/1/10

to

On Thu, Apr 1, 2010 at 4:18 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
> Also we need to think about a more clever way than just accounting the
> number of running and uninterruptible tasks. What we really want is to
> use the numbers which the scheduler has handy, i.e how many tasks did
> we run since we did the last loadavg calculation. It was always
> possible (even before the big loadavg changes) to create a task which
> consumes 50% of a CPU and never shows up in the loadavg calculations
> at all.

I'm not sure I follow how knowing how many tasks have been run since
the last LOAD_FREQ expiration will help, or is that just an example of
the kind of data the scheduler has available that may be useful?

-- Chase

Thomas Gleixner

unread,

Apr 1, 2010, 4:40:02 PM4/1/10

to

On Thu, 1 Apr 2010, Chase Douglas wrote:

> On Thu, Apr 1, 2010 at 4:18 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
> > Also we need to think about a more clever way than just accounting the
> > number of running and uninterruptible tasks. What we really want is to
> > use the numbers which the scheduler has handy, i.e how many tasks did
> > we run since we did the last loadavg calculation. It was always
> > possible (even before the big loadavg changes) to create a task which
> > consumes 50% of a CPU and never shows up in the loadavg calculations
> > at all.
>
> I'm not sure I follow how knowing how many tasks have been run since
> the last LOAD_FREQ expiration will help, or is that just an example of
> the kind of data the scheduler has available that may be useful?

Yes, it's just an example. Look at the following (UP) scenario:

-> tick calcs loadavg

-> task is woken and runs for 50% of a tick
-> task goes to sleep

-> tick calcs loadavg

So loadavg will say 0 forever, while we in reality know that there was
significant work on that CPU - at least when we have a sched_clock
which has a better granularity than the tick itself.

Thanks,

tglx

Peter Zijlstra

unread,

Apr 2, 2010, 4:10:02 AM4/2/10

to

On Thu, 2010-04-01 at 22:18 +0200, Thomas Gleixner wrote:
> Well, the uninterruptible count imbalance is preventing us to calc the
> load avg per cpu, which is something the power saving folks are
> interested in.
>
> If that is fixed we probably have a good chance to collect the per cpu
> stuff and build a combined one - have not looked into the gory details
> of the math yet.

The problem with all this per-cpu loadavg is that it would require a
steady tick on each cpu to age this loadavg, which is quite in
contradiction with power savings as they rather like the nohz feature.

No idea yet on what to do about that.

Chase Douglas

unread,

Apr 5, 2010, 10:50:02 AM4/5/10

to

On Fri, Apr 2, 2010 at 3:59 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Thu, 2010-04-01 at 22:18 +0200, Thomas Gleixner wrote:
>> Well, the uninterruptible count imbalance is preventing us to calc the
>> load avg per cpu, which is something the power saving folks are
>> interested in.
>>
>> If that is fixed we probably have a good chance to collect the per cpu
>> stuff and build a combined one - have not looked into the gory details
>> of the math yet.
>
> The problem with all this per-cpu loadavg is that it would require a
> steady tick on each cpu to age this loadavg, which is quite in
> contradiction with power savings as they rather like the nohz feature.
>
> No idea yet on what to do about that.

What if we stored the time at which a cpu goes idle in its rq. Then,
when the time comes around to calculate the load avg one cpu should be
able to scan the run queues of all the cpus and use their load avg if
not idle, or use their load avg + idle timestamp math if idle.

-- Chase

Andrew Morton

unread,

Apr 13, 2010, 4:50:01 PM4/13/10

to

On Thu, 1 Apr 2010 22:01:59 +0200 (CEST)
Thomas Gleixner <tg...@linutronix.de> wrote:

> On Thu, 1 Apr 2010, Chase Douglas wrote:
> > On Thu, Apr 1, 2010 at 3:27 PM, Andrew Morton <ak...@linux-foundation.org> wrote:
> >
> > Thomas Gleixner seems to think this isn't the best approach (later
> > email in this thread), so I'm deferring sending it unless someone is
> > still interested in this approach.
>
> Well, it's a good workaround for now I think, I just answered Andrews
> question whether we can't do any better :)
>

So... should we merge Chase's patch?

Thomas Gleixner

unread,

Apr 13, 2010, 5:10:01 PM4/13/10

to

On Tue, 13 Apr 2010, Andrew Morton wrote:

> On Thu, 1 Apr 2010 22:01:59 +0200 (CEST)
> Thomas Gleixner <tg...@linutronix.de> wrote:
>
> > On Thu, 1 Apr 2010, Chase Douglas wrote:
> > > On Thu, Apr 1, 2010 at 3:27 PM, Andrew Morton <ak...@linux-foundation.org> wrote:
> > >
> > > Thomas Gleixner seems to think this isn't the best approach (later
> > > email in this thread), so I'm deferring sending it unless someone is
> > > still interested in this approach.
> >
> > Well, it's a good workaround for now I think, I just answered Andrews
> > question whether we can't do any better :)
> >
>
> So... should we merge Chase's patch?

I have no better solution right now. Peter ?

Thanks,

tglx

Chase Douglas

unread,

Apr 13, 2010, 5:10:02 PM4/13/10

to

On Tue, Apr 13, 2010 at 2:02 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
> On Tue, 13 Apr 2010, Andrew Morton wrote:
>
>> On Thu, 1 Apr 2010 22:01:59 +0200 (CEST)
>> Thomas Gleixner <tg...@linutronix.de> wrote:
>>
>> > On Thu, 1 Apr 2010, Chase Douglas wrote:
>> > > On Thu, Apr 1, 2010 at 3:27 PM, Andrew Morton <ak...@linux-foundation.org> wrote:
>> > >
>> > > Thomas Gleixner seems to think this isn't the best approach (later
>> > > email in this thread), so I'm deferring sending it unless someone is
>> > > still interested in this approach.
>> >
>> > Well, it's a good workaround for now I think, I just answered Andrews
>> > question whether we can't do any better :)
>> >
>>
>> So... should we merge Chase's patch?
>
> I have no better solution right now. Peter ?

I've made one small change to it. Checking the atomic .counter
variable probably isn't the most portable, and atomic_long_read()
should be portable and not add any overhead on sane platforms, so I
swapped out the check against .counter with an atomic_long_read().
I'll resend it shortly.

-- Chase