High CPU load when machine is idle

Damien Wyart

unread,

Sep 29, 2010, 3:10:01 AM9/29/10

to

Hello,

With 2.6.36-rc6, I'm seeing a load of around 0.60 when the machine is
completely idle. This is similar to what someone reported for the latest
2.6.35.x stables. This is on a core i7 machine, but I've no time to
bisect or test earlier versions right now, but I guess this is easy to
reproduce on the same plateform.

Best,
--
Damien Wyart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Damien Wyart

unread,

Oct 14, 2010, 11:00:01 AM10/14/10

to

Hello,

> With 2.6.36-rc6, I'm seeing a load of around 0.60 when the machine is
> completely idle. This is similar to what someone reported for the latest
> 2.6.35.x stables. This is on a core i7 machine, but I've no time to
> bisect or test earlier versions right now, but I guess this is easy to
> reproduce on the same plateform.

After further investigation and cross-checking with the thread "PROBLEM:
Unusually high load average when idle in 2.6.35, 2.6.35.1 and later",
I came to the following results:

- the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim
seems to be the culprit;
- reverting it solves the problem with 2.6.36-rc7 in NOHZ mode: the load
when idle goes down to 0.00 (which it never does with the patch
applied)
- using nohz=no with the commit reverted still gives correct behaviour
(tested just in case)

- vanilla 2.6.36-rc7 with this commit applied has the problem (load is
around 0.60 when machine idle, sometimes less after several hours
of uptime, but never 0.00), and rebooting with nohz=no makes the
problem disappear: load goes down to 0.00 quickly after boot process
has finished.

I hope this answers the questions raised in the Tim's thread.

Could someone with knowledge of the commit take a look at the problem?
It would be a bit annoying to have this problem in 2.6.36, since Tim's
initial report dates back to 2 weeks ago...

I can help for further testing if needed.

Thanks in advance,
--
Damien

Chase Douglas

unread,

Oct 14, 2010, 11:30:02 AM10/14/10

to

Sorry I haven't been as responsive to this issue as I would have liked.
I've been rather busy on other work.

My biggest testing concern is that in reality, load on a normal desktop
machine (i.e. not some stripped down machine disconnected from network
or any other input running nothing but busybox) should not be 0.00.
Maybe load on a server doing absolutely nothing could be 0.00, but
there's usually something going on that should bump it up to a few
hundredths. Watch top, and if you see at least 1% constant cpu usage
there, your load average should be at least 0.01. That said, there does
seem to be a bug somewhere as a load of 0.60 on an idle machine seems
high.

-- Chase

Damien Wyart

unread,

Oct 14, 2010, 12:00:01 PM10/14/10

to

* Chase Douglas <chase....@canonical.com> [101014 17:29]:

> My biggest testing concern is that in reality, load on a normal
> desktop machine (i.e. not some stripped down machine disconnected from
> network or any other input running nothing but busybox) should not be
> 0.00. Maybe load on a server doing absolutely nothing could be 0.00,
> but there's usually something going on that should bump it up to a few
> hundredths. Watch top, and if you see at least 1% constant cpu usage
> there, your load average should be at least 0.01. That said, there
> does seem to be a bug somewhere as a load of 0.60 on an idle machine
> seems high.

FWIW, I'm using htop which maybe behaves differently than top, but after
a few seconds or tens of seconds, values are really stabilizing at 0.00
(displaying 0.01 from time to time, but quite rarely). I also get 0.00
through calling "uptime".

These tests have been done on a big desktop machine accessed remotely
through ssh, so the usual graphical environment eyecandy and Web browser
are not running at all, and I have been careful not to run heavy
processes in parallel, so reaching 0.00 doesn't seem abnormal in that
case.

As you wrote, the "bad" case with the commit applied and nohz=yes really
seems wrong because in the same conditions, idle load is several tens
orders of magnitude larger.

--
Damien

Peter Zijlstra

unread,

Oct 15, 2010, 7:10:02 AM10/15/10

to

On Thu, 2010-10-14 at 16:58 +0200, Damien Wyart wrote:

> - the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim
> seems to be the culprit;

Right, so I think I figured out what's happening.

We're folding sucessive idles of the same cpu into the total idle
number, which is inflating things.

+/*
+ * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
+ *
+ * When making the ILB scale, we should try to pull this in as well.
+ */
+static atomic_long_t calc_load_tasks_idle;
+
+static void calc_load_account_idle(struct rq *this_rq)
+{
+ long delta;
+
+ delta = calc_load_fold_active(this_rq);
+ if (delta)
+ atomic_long_add(delta, &calc_load_tasks_idle);
+}
+
+static long calc_load_fold_idle(void)
+{
+ long delta = 0;
+
+ /*
+ * Its got a race, we don't care...
+ */
+ if (atomic_long_read(&calc_load_tasks_idle))
+ delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+
+ return delta;
+}

If you look at that and imagine CPU1 going idle with 1 task blocked,
then waking up due to unblocking, then going idle with that same task
block, etc.. all before we fold_idle on an active cpu, then we can count
that one task many times over.

I haven't come up with a sane patch yet, hackery below, but that does
let my 24-cpu system idle into load 0.0x instead of the constant 1.x it
had before.

Beware, utter hackery below.. lots of races not sure it matters but it
ain't pretty..

Anybody got a bright idea here?

---
kernel/sched.c | 32 ++++++++++++++++++++++++++++++--
kernel/sched_idletask.c | 1 +
2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 91c19db..ac4512d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -521,6 +521,9 @@ struct rq {
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ
+ long calc_load_inactive;
+#endif

#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
@@ -1817,6 +1820,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
#endif

static void calc_load_account_idle(struct rq *this_rq);
+static void calc_load_account_nonidle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);
@@ -2978,14 +2982,33 @@ static long calc_load_fold_active(struct rq *this_rq)
* When making the ILB scale, we should try to pull this in as well.
*/
static atomic_long_t calc_load_tasks_idle;
+static cpumask_var_t calc_load_mask;

static void calc_load_account_idle(struct rq *this_rq)
{
long delta;

delta = calc_load_fold_active(this_rq);
- if (delta)
+ this_rq->calc_load_inactive = delta;
+
+ if (delta) {
atomic_long_add(delta, &calc_load_tasks_idle);
+ cpumask_set_cpu(cpu_of(this_rq), calc_load_mask);
+ }
+
+ trace_printk("idle start: %d %Ld %Ld\n", cpu_of(this_rq), delta,
+ atomic_long_read(&calc_load_tasks_idle));
+}
+
+static void calc_load_account_nonidle(struct rq *this_rq)
+{
+ if (cpumask_test_and_clear_cpu(cpu_of(this_rq), calc_load_mask)) {
+ atomic_long_sub(this_rq->calc_load_inactive, &calc_load_tasks_idle);
+ trace_printk("idle end: %d %Ld %Ld\n", cpu_of(this_rq),
+ this_rq->calc_load_inactive,
+ atomic_long_read(&calc_load_tasks_idle));
+ } else
+ trace_printk("idle end: %d\n", cpu_of(this_rq));
}

static long calc_load_fold_idle(void)
@@ -2995,8 +3018,12 @@ static long calc_load_fold_idle(void)
/*
* Its got a race, we don't care...
*/
- if (atomic_long_read(&calc_load_tasks_idle))
+ if (atomic_long_read(&calc_load_tasks_idle)) {
delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+ cpumask_clear(calc_load_mask);
+ }
+
+ trace_printk("idle fold: %d %Ld\n", smp_processor_id(), delta);

return delta;
}
@@ -7935,6 +7962,7 @@ void __init sched_init(void)
atomic_set(&nohz.load_balancer, nr_cpu_ids);
atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
+ zalloc_cpumask_var(&calc_load_mask, GFP_NOWAIT);
#endif
/* May be allocated at isolcpus cmdline parse time */
if (cpu_isolated_map == NULL)
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f402..a7fa1aa 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -42,6 +42,7 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)

static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
+ calc_load_account_nonidle(rq);
}

static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)

Peter Zijlstra

unread,

Oct 18, 2010, 8:40:02 AM10/18/10

to

OK, I came up with the below, but its not quite working, load continues
to decrease even though I've got a make -j64 running..

Thomas, Chase, any clue?

---
kernel/sched.c | 31 +++++++++++++++++++++++++------
kernel/sched_idletask.c | 1 +
2 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3312c64..a56446b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -521,6 +521,10 @@ struct rq {

/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ

+ long calc_load_idle;
+ int calc_load_seq;

+#endif

#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP

@@ -1817,6 +1821,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)

#endif

static void calc_load_account_idle(struct rq *this_rq);
+static void calc_load_account_nonidle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);

@@ -2978,14 +2983,25 @@ static long calc_load_fold_active(struct rq *this_rq)

* When making the ILB scale, we should try to pull this in as well.
*/
static atomic_long_t calc_load_tasks_idle;

+static atomic_t calc_load_seq;

static void calc_load_account_idle(struct rq *this_rq)

{
- long delta;
+ long idle;

- delta = calc_load_fold_active(this_rq);
- if (delta)
- atomic_long_add(delta, &calc_load_tasks_idle);
+ idle = calc_load_fold_active(this_rq);
+ this_rq->calc_load_idle = idle;
+
+ if (idle) {
+ this_rq->calc_load_seq = atomic_read(&calc_load_seq);
+ atomic_long_add(idle, &calc_load_tasks_idle);
+ }

+}
+
+static void calc_load_account_nonidle(struct rq *this_rq)
+{

+ if (atomic_read(&calc_load_seq) == this_rq->calc_load_seq)
+ atomic_long_sub(this_rq->calc_load_idle, &calc_load_tasks_idle);
}

static long calc_load_fold_idle(void)
@@ -2993,10 +3009,13 @@ static long calc_load_fold_idle(void)
long delta = 0;

/*
- * Its got a race, we don't care...
+ * Its got races, we don't care... its only statistics after all.

*/
- if (atomic_long_read(&calc_load_tasks_idle))
+ if (atomic_long_read(&calc_load_tasks_idle)) {
delta = atomic_long_xchg(&calc_load_tasks_idle, 0);

+ if (delta)
+ atomic_inc(&calc_load_seq);
+ }

return delta;

Damien Wyart

unread,

Oct 20, 2010, 9:30:03 AM10/20/10

to

> > With 2.6.36-rc6, I'm seeing a load of around 0.60 when the machine
> > is completely idle. This is similar to what someone reported for the
> > latest 2.6.35.x stables. This is on a core i7 machine, but I've no
> > time to bisect or test earlier versions right now, but I guess this
> > is easy to reproduce on the same plateform.

> After further investigation and cross-checking with the thread "PROBLEM:
> Unusually high load average when idle in 2.6.35, 2.6.35.1 and later",
> I came to the following results:

> - the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim
> seems to be the culprit;
> - reverting it solves the problem with 2.6.36-rc7 in NOHZ mode: the load
> when idle goes down to 0.00 (which it never does with the patch
> applied)

In fact, after several hours of uptime, I also came into a situation of
the load being around 0.80 or 0.60 when idle with the commit reverted.
So IMHO, just reverting is not a better option than keeping the
offending commit, and a real rework of the code is needed to clean up
the situation.

Should'nt we enlarge the list of CC, because for now, responsivity has
been close to 0 and it seems we will get a 2.6.36 with buggy load avg
calculation. Even if it is only statistics, many supervision tools rely
on the load avg, so for production environments, this is not a good
thing.

Best,

Peter Zijlstra

unread,

Oct 20, 2010, 9:40:02 AM10/20/10

to

On Wed, 2010-10-20 at 15:27 +0200, Damien Wyart wrote:
>
> Should'nt we enlarge the list of CC, because for now, responsivity has
> been close to 0 and it seems we will get a 2.6.36 with buggy load avg
> calculation. Even if it is only statistics, many supervision tools rely
> on the load avg, so for production environments, this is not a good
> thing.

It already contains all the folks who know the code I'm afraid.. :/

I've been playing with it a bit more today, but haven't actually managed
to make it better, just differently worse..

Peter Zijlstra

unread,

Oct 20, 2010, 9:50:01 AM10/20/10

to

On Wed, 2010-10-20 at 15:30 +0200, Peter Zijlstra wrote:
> On Wed, 2010-10-20 at 15:27 +0200, Damien Wyart wrote:
> >
> > Should'nt we enlarge the list of CC, because for now, responsivity has
> > been close to 0 and it seems we will get a 2.6.36 with buggy load avg
> > calculation. Even if it is only statistics, many supervision tools rely
> > on the load avg, so for production environments, this is not a good
> > thing.
>
> It already contains all the folks who know the code I'm afraid.. :/
>
> I've been playing with it a bit more today, but haven't actually managed
> to make it better, just differently worse..

Ah, I just remembered Venki recently poked at this code too, maybe he's
got a bright idea..

Venki, there are cpu-load issues, the reported issue is that idle load
is too high, and I think I can see that happening with the current code
(due to 74f5187ac8).

The flaw I can see in that commit is that we can go idle multiple times
during the LOAD_FREQ window, which will basically inflate the idle
contribution.

All attempts from me to fix that so far have resulted in curious
results..

Would you have a moment to also look at this?

Peter Zijlstra

unread,

Oct 20, 2010, 10:20:02 AM10/20/10

to

On Wed, 2010-10-20 at 15:43 +0200, Peter Zijlstra wrote:
> On Wed, 2010-10-20 at 15:30 +0200, Peter Zijlstra wrote:
> > On Wed, 2010-10-20 at 15:27 +0200, Damien Wyart wrote:
> > >
> > > Should'nt we enlarge the list of CC, because for now, responsivity has
> > > been close to 0 and it seems we will get a 2.6.36 with buggy load avg
> > > calculation. Even if it is only statistics, many supervision tools rely
> > > on the load avg, so for production environments, this is not a good
> > > thing.
> >
> > It already contains all the folks who know the code I'm afraid.. :/
> >
> > I've been playing with it a bit more today, but haven't actually managed
> > to make it better, just differently worse..
>
> Ah, I just remembered Venki recently poked at this code too, maybe he's
> got a bright idea..
>
> Venki, there are cpu-load issues, the reported issue is that idle load
> is too high, and I think I can see that happening with the current code
> (due to 74f5187ac8).
>
> The flaw I can see in that commit is that we can go idle multiple times
> during the LOAD_FREQ window, which will basically inflate the idle
> contribution.
>
> All attempts from me to fix that so far have resulted in curious
> results..
>
> Would you have a moment to also look at this?

Just for reference this is my latest patch.. I figured that since its
NOHZ related it should actually be keyed of off the nohz code, not going
idle.

---
include/linux/sched.h | 8 ++++++++
kernel/sched.c | 28 +++++++++++++++++++++-------
kernel/sched_idletask.c | 1 -
kernel/time/tick-sched.c | 2 ++
4 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0383601..5311ef4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -145,6 +145,14 @@ extern unsigned long this_cpu_load(void);

extern void calc_global_load(void);

+#ifdef CONFIG_NO_HZ
+extern void calc_load_account_idle(void);
+extern void calc_load_account_nonidle(void);
+#else
+static inline void calc_load_account_idle(void) { }
+static inline void calc_load_account_nonidle(void) { }
+#endif
+
extern unsigned long get_parent_ip(unsigned long addr);

struct seq_file;
diff --git a/kernel/sched.c b/kernel/sched.c
index abf8440..79a29e6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -526,6 +526,10 @@ struct rq {

/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ

+ long calc_load_inactive;

+ int calc_load_seq;
+#endif

#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP

@@ -1833,7 +1837,6 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
}
#endif

-static void calc_load_account_idle(struct rq *this_rq);

static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);

@@ -3111,16 +3114,29 @@ static long calc_load_fold_active(struct rq *this_rq)

* When making the ILB scale, we should try to pull this in as well.
*/
static atomic_long_t calc_load_tasks_idle;
+static atomic_t calc_load_seq;

-static void calc_load_account_idle(struct rq *this_rq)
+void calc_load_account_idle(void)
{
+ struct rq *this_rq = this_rq();

long delta;

delta = calc_load_fold_active(this_rq);

+ this_rq->calc_load_inactive = delta;

+ this_rq->calc_load_seq = atomic_read(&calc_load_seq);
+

if (delta)
atomic_long_add(delta, &calc_load_tasks_idle);
}

+void calc_load_account_nonidle(void)
+{
+ struct rq *this_rq = this_rq();

+
+ if (atomic_read(&calc_load_seq) == this_rq->calc_load_seq)

+ atomic_long_add(this_rq->calc_load_inactive, &calc_load_tasks_idle);
+}
+

static long calc_load_fold_idle(void)
{
long delta = 0;

@@ -3128,16 +3144,14 @@ static long calc_load_fold_idle(void)
/*

* Its got a race, we don't care...

*/
- if (atomic_long_read(&calc_load_tasks_idle))
+ if (atomic_long_read(&calc_load_tasks_idle)) {

+ atomic_inc(&calc_load_seq);

delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+ }

return delta;
}
#else
-static void calc_load_account_idle(struct rq *this_rq)
-{
-}
-
static inline long calc_load_fold_idle(void)
{
return 0;
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f402..6ca191f 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -23,7 +23,6 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
static struct task_struct *pick_next_task_idle(struct rq *rq)
{
schedstat_inc(rq, sched_goidle);
- calc_load_account_idle(rq);
return rq->idle;
}

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3e216e0..808abd7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -411,6 +411,7 @@ void tick_nohz_stop_sched_tick(int inidle)
ts->tick_stopped = 1;
ts->idle_jiffies = last_jiffies;
rcu_enter_nohz();
+ calc_load_account_idle();
}

ts->idle_sleeps++;
@@ -520,6 +521,7 @@ void tick_nohz_restart_sched_tick(void)

ts->inidle = 0;

+ calc_load_account_nonidle();
rcu_exit_nohz();

/* Update jiffies first */

Peter Zijlstra

unread,

Oct 20, 2010, 10:30:01 AM10/20/10

to

So that should read: that atomic_long_sub()

Trouble is, load goes down with that patch fixed, it just never goes
up :/

Peter Zijlstra

unread,

Oct 20, 2010, 1:30:01 PM10/20/10

to

OK, how does this work for people? I find my idle load is still a tad
high, but maybe I'm not patient enough.

---
include/linux/sched.h | 8 ++++++++
kernel/sched.c | 32 ++++++++++++++++++++++++++------

kernel/sched_idletask.c | 1 -
kernel/time/tick-sched.c | 2 ++

4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0383601..5311ef4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -145,6 +145,14 @@ extern unsigned long this_cpu_load(void);

extern void calc_global_load(void);

+#ifdef CONFIG_NO_HZ
+extern void calc_load_account_idle(void);
+extern void calc_load_account_nonidle(void);
+#else
+static inline void calc_load_account_idle(void) { }
+static inline void calc_load_account_nonidle(void) { }
+#endif
+
extern unsigned long get_parent_ip(unsigned long addr);

struct seq_file;
diff --git a/kernel/sched.c b/kernel/sched.c

index abf8440..f214222 100644

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -526,6 +526,10 @@ struct rq {
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ
+ long calc_load_inactive;
+ int calc_load_seq;
+#endif

#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
@@ -1833,7 +1837,6 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
}
#endif

-static void calc_load_account_idle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);

@@ -3111,16 +3114,36 @@ static long calc_load_fold_active(struct rq *this_rq)

* When making the ILB scale, we should try to pull this in as well.
*/
static atomic_long_t calc_load_tasks_idle;
+static atomic_t calc_load_seq;

-static void calc_load_account_idle(struct rq *this_rq)
+void calc_load_account_idle(void)
{
+ struct rq *this_rq = this_rq();
long delta;

delta = calc_load_fold_active(this_rq);
+ this_rq->calc_load_inactive = delta;
+ this_rq->calc_load_seq = atomic_read(&calc_load_seq);
+
if (delta)
atomic_long_add(delta, &calc_load_tasks_idle);
}

+void calc_load_account_nonidle(void)
+{
+ struct rq *this_rq = this_rq();
+
+ if (atomic_read(&calc_load_seq) == this_rq->calc_load_seq) {

+ atomic_long_sub(this_rq->calc_load_inactive, &calc_load_tasks_idle);
+ /*
+ * Undo the _fold_active() from _account_idle(). This
+ * avoids us loosing active tasks and creating a negative
+ * bias
+ */
+ this_rq->calc_load_active -= this_rq->calc_load_inactive;
+ }

+}
+
static long calc_load_fold_idle(void)
{
long delta = 0;

@@ -3128,16 +3151,13 @@ static long calc_load_fold_idle(void)

/*
* Its got a race, we don't care...
*/

+ atomic_inc(&calc_load_seq);

if (atomic_long_read(&calc_load_tasks_idle))
delta = atomic_long_xchg(&calc_load_tasks_idle, 0);

Damien Wyart

unread,

Oct 20, 2010, 4:30:02 PM10/20/10

to

* Peter Zijlstra <pet...@infradead.org> [101020 19:26]:

> OK, how does this work for people? I find my idle load is still a tad
> high, but maybe I'm not patient enough.

Looks quite fine after some basic tests, much saner than without the
patch. A bit slow to go down, but I reach 0.00 after enough time being
idle.

Can't tell about the behavior after hours of uptime, of course, but the
values during the first minutes after bootup seems OK; without the
patch, they are evidently wrong...

Maybe commit this after some more reviews in the next 1 or 2 days, and
maybe think about further tweaking during 2.6.37?

--
Damien

tmhi...@gmail.com

unread,

Oct 20, 2010, 9:50:02 PM10/20/10

to

On Wed, Oct 20, 2010 at 07:26:45PM +0200, Peter Zijlstra wrote:
>
>
> OK, how does this work for people? I find my idle load is still a tad
> high, but maybe I'm not patient enough.

I haven't had a chance to keep up with the topic, and I apologize. I'll be
testing this as soon as I can finish compiling it. Thank you all for not
letting this go unfixed.

Tim McGrath

tmhi...@gmail.com

unread,

Oct 20, 2010, 10:00:01 PM10/20/10

to

Uhh, problem. This patch does not apply to git checkout
74f5187ac873042f502227701ed1727e7c5fbfa9

which is the version of the kernel that first exhibits this flaw.

which version of the kernel does this patch apply cleanly to?

Tim McGrath

Ingo Molnar

unread,

Oct 21, 2010, 4:30:01 AM10/21/10

to

* tmhi...@gmail.com <tmhi...@gmail.com> wrote:

Try -tip (which includes the scheduler development tree as well):

http://people.redhat.com/mingo/tip.git/README

Thanks,

Ingo

tmhi...@gmail.com

unread,

Oct 21, 2010, 5:00:02 AM10/21/10

to

On Thu, Oct 21, 2010 at 10:22:33AM +0200, Ingo Molnar wrote:
>
> * tmhi...@gmail.com <tmhi...@gmail.com> wrote:
>
> > On Wed, Oct 20, 2010 at 09:48:43PM -0400, tm@ wrote:
> > > On Wed, Oct 20, 2010 at 07:26:45PM +0200, Peter Zijlstra wrote:
> > > >
> > > >
> > > > OK, how does this work for people? I find my idle load is still a tad
> > > > high, but maybe I'm not patient enough.
> > >
> > > I haven't had a chance to keep up with the topic, and I apologize. I'll be
> > > testing this as soon as I can finish compiling it. Thank you all for not
> > > letting this go unfixed.
> > >
> > > Tim McGrath
> >
> > Uhh, problem. This patch does not apply to git checkout
> > 74f5187ac873042f502227701ed1727e7c5fbfa9
> >
> > which is the version of the kernel that first exhibits this flaw.
> >
> > which version of the kernel does this patch apply cleanly to?
>
> Try -tip (which includes the scheduler development tree as well):
>
> http://people.redhat.com/mingo/tip.git/README
>
> Thanks,
>
> Ingo

Tried that, patch still doesn't apply... and I just figured out why. Looks
like my email client is screwing the patch up. mutt apparently wants to chew
on my mail before I get it. viewing the mail as an attachment and saving it
works properly however.

Now that I have properly saved the mail, it applies cleanly to tip/master as
well as 74f5187ac873042f502227701ed1727e7c5fbfa9 - though in the latter's
case it's having to fuzz around a bit. I'll try testing
74f5187ac873042f502227701ed1727e7c5fbfa9 first since it's the one I *know*
is flawed, and I want to reduce the amount of changes that I have to test
for.

I'll build and test it, then let you guys know if there's any noticable
difference.

Tim McGrath

Peter Zijlstra

unread,

Oct 21, 2010, 8:20:01 AM10/21/10

to

Ok, so while trying to write a changelog on this patch I got myself
terribly confused again..

calc_load_active_fold() is a relative operation and simply gives delta
values since the last time it got called. That means that the sum of
multiple invocations in a given time interval should be identical to a
single invocation.

Therefore, the going idle multiple times during LOAD_FREQ hypothesis
doesn't really make sense.

Even if it became idle but wasn't idle at the LOAD_FREQ turn-over it
shouldn't matter, since the calc_load_account_active() call will simply
fold the remaining delta with the accrued idle delta and the total
should all match up once we fold into the global calc_load_tasks.

So afaict its should all have worked and this patch is a big NOP,.
except it isn't..

Damn I hate this bug.. ;-) Anybody?

Venkatesh Pallipadi

unread,

Oct 21, 2010, 1:20:01 PM10/21/10

to

Yes. Thats what I was thinking trying to understand this code yesterday.

Also with sequence number I don't think nr_interruptible would be
handled correctly
as tasks can move to CPU after it first went idle and may not get
accounted later.

I somehow feel the problem is with nr_interruptible, which gets
accounted multiple
times on idle tasks and only once per LOAD_FREQ on busy tasks.
However, things are
not fully clear to me yet. Have to look at the code a bit more.

Thanks,
Venki

tmhi...@gmail.com

unread,

Oct 21, 2010, 2:40:02 PM10/21/10

to

On Wed, Oct 20, 2010 at 09:48:43PM -0400, tm@ wrote:

Now that I've actually had a chance to boot the kernel with the patch
applied I'm sorry to say but the load average isn't decaying as fast as it
ought to, at the very least. My machine's been idle for the last ten minutes
but the one minute average is still at 0.89 and shooting up to 1.5, the 5
min average is 0.9, and the 15 min average is .68 and climbing. Even as I'm
writing this the averages are continuing to drop, but *very* slowly.
Glacially, almost. The one minute average is continuing to randomly spike
high for no reason I can tell as well.

I'll let you guys know if this actually bottoms out at some point.

Tim McGrath

tmhi...@gmail.com

unread,

Oct 21, 2010, 9:40:02 PM10/21/10

to

It did not. When I came home and checked, my load average was a
steady 0.7-0.8 across the board on all averages with the machine idle since
six hours ago. I guess the patch didn't fix the problem for me. If you want,
I'll try building master/tip with the patch applied, but I doubt it'll
really be different.

On the plus side, the patch did do something - it seems much less
erratic than it used to be for whatever reason, and now just has a very
steady load average rather than jumping about as it does without the patch
applied.

I wish I understood the code enough to know what is going wrong
here. I have to wonder what impact the original bug was causing. It seems to
me that if it only affected a few people it might be worth backing out the
patch and rethinking the problem it was meant to fix. On the other hand, if
there's some way of diagnosing the problem I'm all for it - is there some
kprintfs or something I could put in the code to find out when it's doing
unlikely or 'impossible' things? There is serious weirdness going on here
and I'd like to figure out the cause of it. I get the impression we're all
bumbling about in the dark poking at a gigantic elephant and getting the
wrong impressions.

Tim McGrath

Venkatesh Pallipadi

unread,

Oct 22, 2010, 5:10:02 PM10/22/10

to

I started making small changes to the code, but none of the change helped much.
I think the problem with the current code is that, even though idle CPUs
update load, the fold only happens when one of the CPU is busy
and we end up taking its load into global load.

So, I tried to simplify things and doing the updates directly from idle loop.
This is only a test patch, and eventually we need to hook it off somewhere
else, instead of idle loop and also this is expected work only as x86_64
right now.

Peter: Do you think something like this will work? loadavg went
quite on two of my test systems after this change (4 cpu and 24 cpu).

Thanks,
Venki

---
arch/x86/kernel/process_64.c | 2 +
kernel/sched.c | 67 +++++++++++------------------------------
kernel/sched_idletask.c | 1 -
3 files changed, 20 insertions(+), 50 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..aaa8025 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -101,6 +101,7 @@ static inline void play_dead(void)
}
#endif

+void idle_load_update(void);
/*
* The idle thread. There's no useful work to be
* done, so just try to conserve power and have a
@@ -140,6 +141,7 @@ void cpu_idle(void)
stop_critical_timings();
pm_idle();
start_critical_timings();
+ idle_load_update();

trace_power_end(smp_processor_id());

diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..6d589c1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1819,7 +1819,6 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)

}
#endif

-static void calc_load_account_idle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);

@@ -2959,11 +2958,12 @@ static unsigned long calc_load_update;
unsigned long avenrun[3];
EXPORT_SYMBOL(avenrun);

-static long calc_load_fold_active(struct rq *this_rq)
+static long calc_load_fold(struct rq *this_rq, int idle)
{
- long nr_active, delta = 0;
+ long nr_active = 0, delta = 0;

- nr_active = this_rq->nr_running;
+ if (!idle)
+ nr_active = this_rq->nr_running;
nr_active += (long) this_rq->nr_uninterruptible;

if (nr_active != this_rq->calc_load_active) {
@@ -2974,46 +2974,6 @@ static long calc_load_fold_active(struct rq *this_rq)
return delta;
}

-#ifdef CONFIG_NO_HZ
-/*
- * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
- *
- * When making the ILB scale, we should try to pull this in as well.
- */
-static atomic_long_t calc_load_tasks_idle;
-

-static void calc_load_account_idle(struct rq *this_rq)

-{
- long delta;
-

- delta = calc_load_fold_active(this_rq);
- if (delta)
- atomic_long_add(delta, &calc_load_tasks_idle);

-}
-
-static long calc_load_fold_idle(void)
-{
- long delta = 0;
-
- /*
- * Its got a race, we don't care...
- */
- if (atomic_long_read(&calc_load_tasks_idle))
- delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
-
- return delta;
-}
-#else

-static void calc_load_account_idle(struct rq *this_rq)

-{
-}
-
-static inline long calc_load_fold_idle(void)
-{
- return 0;
-}
-#endif
-
/**
* get_avenrun - get the load average array
* @loads: pointer to dest load array
@@ -3043,7 +3003,7 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
*/
void calc_global_load(void)
{
- unsigned long upd = calc_load_update + 10;
+ unsigned long upd = calc_load_update + LOAD_FREQ/2;
long active;

if (time_before(jiffies, upd))
@@ -3063,21 +3023,30 @@ void calc_global_load(void)
* Called from update_cpu_load() to periodically update this CPU's
* active count.
*/
-static void calc_load_account_active(struct rq *this_rq)
+static void calc_load_account(struct rq *this_rq, int idle)
{
long delta;

if (time_before(jiffies, this_rq->calc_load_update))
return;

- delta = calc_load_fold_active(this_rq);
- delta += calc_load_fold_idle();
+ delta = calc_load_fold(this_rq, idle);
if (delta)
atomic_long_add(delta, &calc_load_tasks);

this_rq->calc_load_update += LOAD_FREQ;
}

+void idle_load_update(void)
+{
+ struct rq *rq = this_rq();
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ calc_load_account(rq, 1);
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+
/*
* The exact cpuload at various idx values, calculated at every tick would be
* load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load
@@ -3194,7 +3163,7 @@ static void update_cpu_load_active(struct rq *this_rq)
{
update_cpu_load(this_rq);

- calc_load_account_active(this_rq);
+ calc_load_account(this_rq, 0);
}

#ifdef CONFIG_SMP
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f40..6ca191f 100644

--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -23,7 +23,6 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
static struct task_struct *pick_next_task_idle(struct rq *rq)
{
schedstat_inc(rq, sched_goidle);
- calc_load_account_idle(rq);
return rq->idle;
}

--
1.7.1

Venkatesh Pallipadi

unread,

Oct 22, 2010, 7:10:01 PM10/22/10

to

(Sorry about the subjectless earlier mail)

Thanks,
Venki

-static void calc_load_account_idle(struct rq *this_rq)

-{
- long delta;
-
- delta = calc_load_fold_active(this_rq);
- if (delta)
- atomic_long_add(delta, &calc_load_tasks_idle);
-}
-
-static long calc_load_fold_idle(void)
-{
- long delta = 0;
-
- /*
- * Its got a race, we don't care...
- */
- if (atomic_long_read(&calc_load_tasks_idle))
- delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
-
- return delta;
-}
-#else

-static void calc_load_account_idle(struct rq *this_rq)

--

tmhi...@gmail.com

unread,

Oct 22, 2010, 10:20:02 PM10/22/10

to

On Fri, Oct 22, 2010 at 04:03:42PM -0700, Venkatesh Pallipadi wrote:
> (Sorry about the subjectless earlier mail)
>
> I started making small changes to the code, but none of the change helped much.
> I think the problem with the current code is that, even though idle CPUs
> update load, the fold only happens when one of the CPU is busy
> and we end up taking its load into global load.
>
> So, I tried to simplify things and doing the updates directly from idle loop.
> This is only a test patch, and eventually we need to hook it off somewhere
> else, instead of idle loop and also this is expected work only as x86_64
> right now.
>
> Peter: Do you think something like this will work? loadavg went
> quite on two of my test systems after this change (4 cpu and 24 cpu).
>
> Thanks,
> Venki

I'd really like to be able to help test this, but with a 32bit x86
machine I guess this patch won't do anything for me. How hard would it be to
mangle this into working for my machine or should I just wait?

Tim McGrath

Message has been deleted

Venkatesh Pallipadi

unread,

Oct 25, 2010, 12:30:02 PM10/25/10

to

On Mon, Oct 25, 2010 at 3:12 AM, Peter Zijlstra <pet...@infradead.org> wrote:

> On Fri, 2010-10-22 at 16:03 -0700, Venkatesh Pallipadi wrote:
>> I started making small changes to the code, but none of the change helped much.
>> I think the problem with the current code is that, even though idle CPUs
>> update load, the fold only happens when one of the CPU is busy
>> and we end up taking its load into global load.
>>
>> So, I tried to simplify things and doing the updates directly from idle loop.
>> This is only a test patch, and eventually we need to hook it off somewhere
>> else, instead of idle loop and also this is expected work only as x86_64
>> right now.
>>
>> Peter: Do you think something like this will work? loadavg went
>> quite on two of my test systems after this change (4 cpu and 24 cpu).
>

> Not really, CPUs can stay idle for _very_ long times (!x86 cpus that
> don't have crappy timers like HPET which roll around every 2-4 seconds).
>
> But all CPUs staying idle for a long time is exactly the scenario you
> fix before using the decay_load_misses() stuff, except that is for the
> load-balancer per-cpu load numbers not the global cpu load avg. Won't a
> similar approach work here?
>

Yes. Thought about that. One problem there is that works with nohz_idle_balance,
which will not be called if all the CPUs are idle for example.
As this is once in 5 secs, probably doing nr_running() and
nr_uninterruptible() should
be OK even on huge systems. But, that was the original code here, except that
it was inside xtime_lock.

Thanks,
Venki

Peter Zijlstra

unread,

Oct 26, 2010, 8:50:02 AM10/26/10

to

On Mon, 2010-10-25 at 12:12 +0200, Peter Zijlstra wrote:
> On Fri, 2010-10-22 at 16:03 -0700, Venkatesh Pallipadi wrote:

> > I started making small changes to the code, but none of the change helped much.
> > I think the problem with the current code is that, even though idle CPUs
> > update load, the fold only happens when one of the CPU is busy
> > and we end up taking its load into global load.
> >
> > So, I tried to simplify things and doing the updates directly from idle loop.
> > This is only a test patch, and eventually we need to hook it off somewhere
> > else, instead of idle loop and also this is expected work only as x86_64
> > right now.
> >
> > Peter: Do you think something like this will work? loadavg went
> > quite on two of my test systems after this change (4 cpu and 24 cpu).
>

> Not really, CPUs can stay idle for _very_ long times (!x86 cpus that
> don't have crappy timers like HPET which roll around every 2-4 seconds).
>
> But all CPUs staying idle for a long time is exactly the scenario you
> fix before using the decay_load_misses() stuff, except that is for the
> load-balancer per-cpu load numbers not the global cpu load avg. Won't a
> similar approach work here?

The crude patch would be something like the below.. a smarter patch will
try and avoid that loop.

---
include/linux/sched.h | 2 +-
kernel/sched.c | 20 +++++++++-----------
kernel/timer.c | 2 +-
3 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7a6e81f..84c1bf1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -143,7 +143,7 @@ extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);

-extern void calc_global_load(void);
+extern void calc_global_load(int ticks);

extern unsigned long get_parent_ip(unsigned long addr);

diff --git a/kernel/sched.c b/kernel/sched.c
index 41f1869..49a2baf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3171,22 +3171,20 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
* calc_load - update the avenrun load estimates 10 ticks after the
* CPUs have updated calc_load_tasks.
*/
-void calc_global_load(void)
+void calc_global_load(int ticks)

{
- unsigned long upd = calc_load_update + 10;

long active;

- if (time_before(jiffies, upd))
- return;
-
- active = atomic_long_read(&calc_load_tasks);
- active = active > 0 ? active * FIXED_1 : 0;
+ while (!time_before(jiffies, calc_load_update + 10)) {
+ active = atomic_long_read(&calc_load_tasks);
+ active = active > 0 ? active * FIXED_1 : 0;

- avenrun[0] = calc_load(avenrun[0], EXP_1, active);
- avenrun[1] = calc_load(avenrun[1], EXP_5, active);
- avenrun[2] = calc_load(avenrun[2], EXP_15, active);
+ avenrun[0] = calc_load(avenrun[0], EXP_1, active);
+ avenrun[1] = calc_load(avenrun[1], EXP_5, active);
+ avenrun[2] = calc_load(avenrun[2], EXP_15, active);

- calc_load_update += LOAD_FREQ;
+ calc_load_update += LOAD_FREQ;
+ }
}

/*
diff --git a/kernel/timer.c b/kernel/timer.c
index d6ccb90..9f82b2a 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1297,7 +1297,7 @@ void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_wall_time();
- calc_global_load();
+ calc_global_load(ticks);
}

#ifdef __ARCH_WANT_SYS_ALARM

Peter Zijlstra

unread,

Oct 26, 2010, 10:10:02 AM10/26/10

to

On Tue, 2010-10-26 at 14:44 +0200, Peter Zijlstra wrote:
> a smarter patch will try and avoid that loop.

a1 = a0 * e + a * (1 - e)

a2 = a1 * e + a * (1 - e)
= (a0 * e + a * (1 - e)) * e + a * (1 - e)
= a0 * e^2 + a * (1 - e) * (1 + e)

a3 = a2 * e + a * (1 - e)
= (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
= a0 * e^3 + a * (1 - e) * (1 + e + e^2)

an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1)
= a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
= a0 * e^n + a * (1 - e^n)

the trouble seems to be that that is a rather slow function, stuffing
that in a table will either give us very large tables or very coarse
decay.

n: an an * 2048

1: 0.99446 2037
2: 0.98895 2025
4: 0.978023 2003
8: 0.956529 1959
16: 0.914947 1874
32: 0.837128 1714
64: 0.700784 1435
128: 0.491098 1006
256: 0.241177 494
512: 0.0581666 119
1024: 0.00338335 7
2048: 1.14471e-05 0

Peter Zijlstra

unread,

Oct 29, 2010, 3:50:02 PM10/29/10

to

On Tue, 2010-10-26 at 16:05 +0200, Peter Zijlstra wrote:
>
> a1 = a0 * e + a * (1 - e)
>
> a2 = a1 * e + a * (1 - e)
> = (a0 * e + a * (1 - e)) * e + a * (1 - e)
> = a0 * e^2 + a * (1 - e) * (1 + e)
>
> a3 = a2 * e + a * (1 - e)
> = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
> = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
>
> an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1)
> = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
> = a0 * e^n + a * (1 - e^n)

not tested other than compile, but how does that look...?

---
include/linux/sched.h | 2 +-
kernel/sched.c | 74 +++++++++++++++++++++++++++++++++++++++++++++---
kernel/timer.c | 2 +-
3 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b3d07df..f878db3 100644

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -143,7 +143,7 @@ extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);

-extern void calc_global_load(void);
+extern void calc_global_load(int ticks);

extern unsigned long get_parent_ip(unsigned long addr);

diff --git a/kernel/sched.c b/kernel/sched.c

index 41f1869..125417e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3168,10 +3168,64 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
}

/*
+ * fixed_power_int - compute: x^n in O(log n) time
+ * @x: base of the power
+ * frac_bits: fractional bits of @x
+ * @n: power to raise @x to.
+ */
+static unsigned long
+fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
+{
+ unsigned long result = 1UL << frac_bits;
+
+ while (n) {
+ if (n & 1) {
+ result *= x;
+ result += 1UL << (frac_bits - 1);
+ result >>= frac_bits;
+ }
+ n >>= 1;
+ x *= x;
+ x >>= frac_bits;
+ }
+
+ return result;
+}
+
+/*
+ * a1 = a0 * e + a * (1 - e)
+ *
+ * a2 = a1 * e + a * (1 - e)
+ * = (a0 * e + a * (1 - e)) * e + a * (1 - e)
+ * = a0 * e^2 + a * (1 - e) * (1 + e)
+ *
+ * a3 = a2 * e + a * (1 - e)
+ * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
+ * = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
+ *
+ * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1)
+ * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
+ * = a0 * e^n + a * (1 - e^n)
+ */
+static unsigned long
+calc_load_n(unsigned long load, unsigned long exp,
+ unsigned long active, unsigned int n)
+{
+ unsigned long en = fixed_power_int(exp, FSHIFT, n);
+
+ load *= en;
+ load += active * (FIXED_1 - en);
+ load += 1UL << (FSHIFT - 1);
+ load >>= FSHIFT;
+
+ return load;
+}
+
+/*

* calc_load - update the avenrun load estimates 10 ticks after the
* CPUs have updated calc_load_tasks.
*/
-void calc_global_load(void)
+void calc_global_load(int ticks)
{

unsigned long upd = calc_load_update + 10;
long active;

@@ -3182,11 +3236,21 @@ void calc_global_load(void)
active = atomic_long_read(&calc_load_tasks);

active = active > 0 ? active * FIXED_1 : 0;

- avenrun[0] = calc_load(avenrun[0], EXP_1, active);
- avenrun[1] = calc_load(avenrun[1], EXP_5, active);
- avenrun[2] = calc_load(avenrun[2], EXP_15, active);

+ if (ticks > LOAD_FREQ) {
+ int n = (jiffies - upd) / LOAD_FREQ;
+
+ avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
+ avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
+ avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);

- calc_load_update += LOAD_FREQ;
+ calc_load_update += n * LOAD_FREQ;
+ } else {

+ avenrun[0] = calc_load(avenrun[0], EXP_1, active);
+ avenrun[1] = calc_load(avenrun[1], EXP_5, active);
+ avenrun[2] = calc_load(avenrun[2], EXP_15, active);

+

+ calc_load_update += LOAD_FREQ;
+ }
}

/*
diff --git a/kernel/timer.c b/kernel/timer.c
index d6ccb90..9f82b2a 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1297,7 +1297,7 @@ void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_wall_time();
- calc_global_load();
+ calc_global_load(ticks);
}

#ifdef __ARCH_WANT_SYS_ALARM

--

Kyle McMartin

unread,

Nov 9, 2010, 2:00:02 PM11/9/10

to

On Fri, Oct 29, 2010 at 09:42:23PM +0200, Peter Zijlstra wrote:
>
> not tested other than compile, but how does that look...?
>

Hi Peter,

I finally got around to building a test kernel for a user who had been
reporting this against F-14 (2.6.35) but they report there's not much
improvement.

Sorry. :/

<quote>
Date: Tue, 09 Nov 2010 12:46:53 -0600
From: Michael Cronenworth <mi...@cchtml.com>
To: te...@lists.fedoraproject.org
Subject: Re: Virtually impossible to keep load under 0.5 on idle desktop with
F14+kms

On 11/09/2010 11:55 AM, Kyle McMartin wrote:
> http://kyle.fedorapeople.org/kernel/2.6.35.6-52.sched1.fc14/

Running this kernel is not any better. After leaving my system idle for
10 minutes I finally saw it dip below 0.10, but after leaving the
computer, and coming back 10 minutes later, I see a load of 0.25. No
programs running.

Before I hit "send" on this message, I ran uptime again and got this:
$ uptime
12:45:24 up 32 min, 3 users, load average: 0.82, 0.43, 0.34

Thunderbird was the only program running. 99.8% idle.
</quote>

regards, Kyle

Peter Zijlstra

unread,

Nov 9, 2010, 2:10:02 PM11/9/10

to

On Tue, 2010-11-09 at 13:55 -0500, Kyle McMartin wrote:
> On Fri, Oct 29, 2010 at 09:42:23PM +0200, Peter Zijlstra wrote:
> >
> > not tested other than compile, but how does that look...?
> >
>
> Hi Peter,
>
> I finally got around to building a test kernel for a user who had been
> reporting this against F-14 (2.6.35) but they report there's not much
> improvement.
>
> Sorry. :/

Right, I tested it myself yesterday and got similar results. Does the
'simple' patch from a few posts back work for people?

http://lkml.org/lkml/2010/10/26/150

If that does work I fudged the fancy math, if that doesn't work either
we're back to square zero.

tmhi...@gmail.com

unread,

Nov 9, 2010, 9:40:02 PM11/9/10

to

On Tue, Nov 09, 2010 at 08:02:28PM +0100, Peter Zijlstra wrote:
> On Tue, 2010-11-09 at 13:55 -0500, Kyle McMartin wrote:
> > On Fri, Oct 29, 2010 at 09:42:23PM +0200, Peter Zijlstra wrote:
> > >
> > > not tested other than compile, but how does that look...?
> > >
> >
> > Hi Peter,
> >
> > I finally got around to building a test kernel for a user who had been
> > reporting this against F-14 (2.6.35) but they report there's not much
> > improvement.
> >
> > Sorry. :/
>
> Right, I tested it myself yesterday and got similar results. Does the
> 'simple' patch from a few posts back work for people?
>
> http://lkml.org/lkml/2010/10/26/150
>
> If that does work I fudged the fancy math, if that doesn't work either
> we're back to square zero.

I seem to remember someone saying this would only work on x86-64; if this is
meant to work with a 32bit intel single core processor, please let me know -
I'd like to try it. It would also help if I knew which kernel version this
simpler patch is meant to apply to.

Thank you,
Tim McGrath

Kyle McMartin

unread,

Nov 9, 2010, 10:50:01 PM11/9/10

to

On Tue, Nov 09, 2010 at 08:02:28PM +0100, Peter Zijlstra wrote:

> Right, I tested it myself yesterday and got similar results. Does the
> 'simple' patch from a few posts back work for people?
>

Apparently it doesn't make a large amount of difference. :/

regards, Kyle

Peter Zijlstra

unread,

Nov 10, 2010, 7:10:02 AM11/10/10

to

Venki's patch was, the one I linked should work on every arch.

Peter Zijlstra

unread,

Nov 10, 2010, 7:10:02 AM11/10/10

to

On Tue, 2010-11-09 at 22:45 -0500, Kyle McMartin wrote:
> On Tue, Nov 09, 2010 at 08:02:28PM +0100, Peter Zijlstra wrote:
> > Right, I tested it myself yesterday and got similar results. Does the
> > 'simple' patch from a few posts back work for people?
> >
>
> Apparently it doesn't make a large amount of difference. :/

Crud, ok, back to basics it is.. thanks for testing!

tmhi...@gmail.com

unread,

Nov 14, 2010, 12:20:01 AM11/14/10

to

On Wed, Nov 10, 2010 at 01:00:24PM +0100, Peter Zijlstra wrote:
> On Tue, 2010-11-09 at 22:45 -0500, Kyle McMartin wrote:
> > On Tue, Nov 09, 2010 at 08:02:28PM +0100, Peter Zijlstra wrote:
> > > Right, I tested it myself yesterday and got similar results. Does the
> > > 'simple' patch from a few posts back work for people?
> > >
> >
> > Apparently it doesn't make a large amount of difference. :/
>
> Crud, ok, back to basics it is.. thanks for testing!

Much like the other tester, this doesn't make any difference at all for me,
load fluctuates semi randomly as it did before with an average after 20 mins
of around 0.8 with the machine idle. I have to wonder, is there currently a
reason NOT to revert this commit? Does this commit that's causing problems
for myself and other people actually fix problems for some people that
existed before? I just want to know why there seems to be a struggle to fix
the current implementation rather than backing out the change that caused
the problem I'm dealing with and then trying to work from there.

If you've forgotten, I'm talking specifically about this commit:

[74f5187ac873042f502227701ed1727e7c5fbfa9] sched: Cure load average vs NO_HZ woes

We've gone through two kernel releases with this commit, and I think I'm
understandably curious why it's not been reverted if it's causing problems
with no solution in sight.

Note that I'm currently running 2.6.35.8 with CONFIG_NO_HZ unset with no
problems with the load average, so at least that works properly, and proves
that I'm not being affected by some other quirk at least...

Machine's only been up ten minutes or so, but here's the uptime data anyway
for the curious:

00:10:02 up 12 min, 4 users, load average: 0.00, 0.11, 0.14

Tim McGrath

Damien Wyart

unread,

Nov 25, 2010, 8:50:02 AM11/25/10

to

Hi Peter,

Do you have a plan about solving this problem in the coming weeks or
will it still be present in 2.6.37 final? I am not implying any critic
in any way, just want to know a status because compared to other bugs
discussed in lkml, this one has seen very little technical proposals
towards a resolution (except your attempts). People who signed off the
discussed commit did not make any technical analysis about the bug
reported and this gives a strange feeling. Earth will not stop rotating
if it is not solved, but I still this it is quite important and affects
many users.

> [74f5187ac873042f502227701ed1727e7c5fbfa9] sched: Cure load average vs> NO_HZ woes

> We've gone through two kernel releases with this commit, and I think
> I'm understandably curious why it's not been reverted if it's causing
> problems with no solution in sight.

Even with this commit reverted, the load is incorrect in NO_HZ mode, so
revert only is not enough.

--
Damien Wyart

Peter Zijlstra

unread,

Nov 25, 2010, 9:10:02 AM11/25/10

to

On Thu, 2010-11-25 at 14:31 +0100, Damien Wyart wrote:
> Hi Peter,
>
> Do you have a plan about solving this problem in the coming weeks or
> will it still be present in 2.6.37 final? I am not implying any critic
> in any way, just want to know a status because compared to other bugs
> discussed in lkml, this one has seen very little technical proposals
> towards a resolution (except your attempts). People who signed off the
> discussed commit did not make any technical analysis about the bug
> reported and this gives a strange feeling. Earth will not stop rotating
> if it is not solved, but I still this it is quite important and affects
> many users.

Yes I agree, I've put the problem on the back burner for a bit so that I
can forget all my tinkering and try again with a fresher mind :-)

I'll try and get around to it somewhere this week again.

Peter Zijlstra

unread,

Nov 27, 2010, 3:20:01 PM11/27/10

to

On Thu, 2010-11-25 at 15:03 +0100, Peter Zijlstra wrote:

> I'll try and get around to it somewhere this week again.

How does this work for you? Its hideous but lets start simple.

---
kernel/sched.c | 5 +++++
kernel/timer.c | 14 ++++++++++++--
2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..2ebb07c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3223,6 +3223,11 @@ static void calc_load_account_active(struct rq *this_rq)
this_rq->calc_load_update += LOAD_FREQ;
}

+void calc_load_account_this(void)
+{
+ calc_load_account_active(this_rq());

+}
+
/*
* The exact cpuload at various idx values, calculated at every tick would be
* load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load

diff --git a/kernel/timer.c b/kernel/timer.c
index 68a9ae7..af095c0 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1315,11 +1315,21 @@ void run_local_timers(void)
* jiffies is defined in the linker script...
*/

+extern void calc_load_account_this(void);
+
void do_timer(unsigned long ticks)
{
- jiffies_64 += ticks;
+ if (ticks > 1) {
+ while (ticks--) {
+ jiffies_64++;
+ calc_load_account_this();
+ calc_global_load();
+ }
+ } else {
+ jiffies_64 += ticks;
+ calc_global_load();
+ }
update_wall_time();
- calc_global_load();
}

#ifdef __ARCH_WANT_SYS_ALARM

Kyle McMartin

unread,

Nov 27, 2010, 11:30:01 PM11/27/10

to

On Sat, Nov 27, 2010 at 09:15:20PM +0100, Peter Zijlstra wrote:
> > I'll try and get around to it somewhere this week again.
>
> How does this work for you? Its hideous but lets start simple.
>

At least one user reports noticeably lower idle load average using a
kernel with that patch.

Thanks, looks promising!

Kyle

Damien Wyart

unread,

Nov 28, 2010, 6:50:01 AM11/28/10

to

Hi,

* Peter Zijlstra <pet...@infradead.org> [2010-11-27 21:15]:

> How does this work for you? Its hideous but lets start simple.

> [...]

Doesn't give wrong numbers like initial bug and tentative patches, but
feels a bit too slow when numbers go up and down. Correct values are
reached when waiting long enough, but it feels slow.

As I've tested many combinations, maybe this is an impression because
I do not remember about "normal" delays for the load to rise and fall,
but this still feels slow.

Thanks for working on this problem again.

--
Damien Wyart

Valdis.K...@vt.edu

unread,

Nov 28, 2010, 1:10:02 PM11/28/10

to

On Sun, 28 Nov 2010 12:40:27 +0100, Damien Wyart said:
> Hi,
>
> * Peter Zijlstra <pet...@infradead.org> [2010-11-27 21:15]:
> > How does this work for you? Its hideous but lets start simple.
> > [...]
>
> Doesn't give wrong numbers like initial bug and tentative patches, but
> feels a bit too slow when numbers go up and down. Correct values are
> reached when waiting long enough, but it feels slow.

Umm.. maybe the code still averages the 1/5/15 over that many minute's
worth of ticks, which may be longer for a tickless kernel?

Peter Zijlstra

unread,

Nov 29, 2010, 6:40:01 AM11/29/10

to

On Sun, 2010-11-28 at 12:40 +0100, Damien Wyart wrote:
> Hi,
>
> * Peter Zijlstra <pet...@infradead.org> [2010-11-27 21:15]:
> > How does this work for you? Its hideous but lets start simple.
> > [...]
>
> Doesn't give wrong numbers like initial bug and tentative patches, but
> feels a bit too slow when numbers go up and down. Correct values are
> reached when waiting long enough, but it feels slow.
>
> As I've tested many combinations, maybe this is an impression because
> I do not remember about "normal" delays for the load to rise and fall,
> but this still feels slow.

You can test this by either booting with nohz=off, or builting with
CONFIG_NO_HZ=n and then comparing the result, something like

make O=defconfig clean; while sleep 10; do uptime >> load.log; done &
make -j32 O=defconfig; kill %1

And comparing the curves between the NO_HZ and !NO_HZ kernels.

I'll try and make the patch less hideous ;-)

tmhi...@gmail.com

unread,

Nov 29, 2010, 2:50:01 PM11/29/10

to

On Mon, Nov 29, 2010 at 12:38:46PM +0100, Peter Zijlstra wrote:
> On Sun, 2010-11-28 at 12:40 +0100, Damien Wyart wrote:
> > Hi,
> >
> > * Peter Zijlstra <pet...@infradead.org> [2010-11-27 21:15]:
> > > How does this work for you? Its hideous but lets start simple.
> > > [...]
> >
> > Doesn't give wrong numbers like initial bug and tentative patches, but
> > feels a bit too slow when numbers go up and down. Correct values are
> > reached when waiting long enough, but it feels slow.
> >
> > As I've tested many combinations, maybe this is an impression because
> > I do not remember about "normal" delays for the load to rise and fall,
> > but this still feels slow.
>
> You can test this by either booting with nohz=off, or builting with
> CONFIG_NO_HZ=n and then comparing the result, something like
>
> make O=defconfig clean; while sleep 10; do uptime >> load.log; done &
> make -j32 O=defconfig; kill %1
>
> And comparing the curves between the NO_HZ and !NO_HZ kernels.
>
> I'll try and make the patch less hideous ;-)

I've tested this patch on my own use case, and it seems to work for the most
part - it's still not settling as low as the previous implementation used
to, nor is it settling as low as CONFIG_NO_HZ=N (that is to say, 0.00 across
the board when not being used) however, this is definitely an improvement:

14:26:04 up 9:08, 5 users, load average: 0.05, 0.01, 0.00

This is the result of running uptime on a checked out version of

[74f5187ac873042f502227701ed1727e7c5fbfa9] sched: Cure load average vs NO_HZ woes

with the patch applied, starting X, and simply letting the machine sit idle
for nine hours. For the brief period I spent watching it after boot, it
quickly began settling down to a reasonable value, I only let it sit idle
this long to verify the loadavg was consistently low. (the loadavg was
consistently erratic, anywhere from 0.6 to 1.2 with the machine idle without
this patch)

Thank you for the hard work,
Tim McGrath

Peter Zijlstra

unread,

Nov 29, 2010, 6:10:02 PM11/29/10

to

Ok, that's good testing.. so its still not quite the same as NO_HZ=n,
how about this one?

(it seems to drop down to 0.00 if I wait a few minutes with top -d5)

---
kernel/sched.c | 5 +++++
kernel/time/tick-sched.c | 4 +++-
kernel/timer.c | 12 ++++++++++++
3 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 864040c..a859158 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3082,6 +3082,11 @@ static void calc_load_account_active(struct rq *this_rq)

this_rq->calc_load_update += LOAD_FREQ;
}

+void calc_load_account_this(void)
+{
+ calc_load_account_active(this_rq());
+}
+
/*
* The exact cpuload at various idx values, calculated at every tick would be
* load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3e216e0..1e6d384 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -41,6 +41,8 @@ struct tick_sched *tick_get_tick_sched(int cpu)
return &per_cpu(tick_cpu_sched, cpu);
}

+extern void do_timer_nohz(unsigned long ticks);
+
/*
* Must be called with interrupts disabled !
*/
@@ -75,7 +77,7 @@ static void tick_do_update_jiffies64(ktime_t now)
last_jiffies_update = ktime_add_ns(last_jiffies_update,
incr * ticks);
}
- do_timer(++ticks);
+ do_timer_nohz(++ticks);

/* Keep the tick_next_period variable up to date */
tick_next_period = ktime_add(last_jiffies_update, tick_period);
diff --git a/kernel/timer.c b/kernel/timer.c
index d6ccb90..eb2646f 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1300,6 +1300,18 @@ void do_timer(unsigned long ticks)
calc_global_load();
}

+extern void calc_load_account_this(void);
+
+void do_timer_nohz(unsigned long ticks)
+{

+ while (ticks--) {
+ jiffies_64++;
+ calc_load_account_this();
+ calc_global_load();
+ }

+ update_wall_time();
+}
+
#ifdef __ARCH_WANT_SYS_ALARM

/*

Peter Zijlstra

unread,

Nov 30, 2010, 10:00:02 AM11/30/10

to

On Tue, 2010-11-30 at 00:01 +0100, Peter Zijlstra wrote:
>
> Ok, that's good testing.. so its still not quite the same as NO_HZ=n,
> how about this one?
>
> (it seems to drop down to 0.00 if I wait a few minutes with top -d5)

OK, so here's a less crufty patch that gets the same result on my
machine, load drops down to 0.00 after a while.

It seems a bit slower to reach 0.00, but that could be because I
actually changed the load computation for NO_HZ=n as well, I added a
rounding factor in calc_load(), we no longer truncate the division.

If people want to compare, simply remove the third line from
calc_load(): load += 1UL << (FSHIFT - 1), to restore the old behaviour.

---
include/linux/sched.h | 2 +-
kernel/sched.c | 127 ++++++++++++++++++++++++++++++++++++++++++++----
kernel/timer.c | 2 +-
3 files changed, 118 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index de4d68f..a6e0c62 100644

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -143,7 +143,7 @@ extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);

-extern void calc_global_load(void);

+extern void calc_global_load(unsigned long ticks);

extern unsigned long get_parent_ip(unsigned long addr);

diff --git a/kernel/sched.c b/kernel/sched.c
index 864040c..7868a18 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2978,6 +2978,15 @@ static long calc_load_fold_active(struct rq *this_rq)
return delta;
}

+static unsigned long
+calc_load(unsigned long load, unsigned long exp, unsigned long active)
+{
+ load *= exp;
+ load += active * (FIXED_1 - exp);

+ load += 1UL << (FSHIFT - 1);

+ return load >> FSHIFT;
+}
+
#ifdef CONFIG_NO_HZ
/*

* For NO_HZ we delay the active fold to the next LOAD_FREQ update.

@@ -3007,6 +3016,105 @@ static long calc_load_fold_idle(void)

return delta;
}
+
+/**
+ * fixed_power_int - compute: x^n, in O(log n) time

+ *
+ * @x: base of the power

+ * @frac_bits: fractional bits of @x

+ * @n: power to raise @x to.
+ *

+ * By exploiting the relation between the definition of the natural power
+ * function: x^n := x*x*...*x (x multiplied by itself for n times), and
+ * the binary encoding of numbers used by computers: n := \Sum n_i * 2^i,
+ * (where: n_i \elem {0, 1}, the binary vector representing n),
+ * we find: x^n := x^(\Sum n_i * 2^i) := \Prod x^(n_i * 2^i), which is
+ * of course trivially computable in O(log_2 n), the length of our binary
+ * vector.

+ */
+static unsigned long
+fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
+{
+ unsigned long result = 1UL << frac_bits;
+

+ if (n) for (;;) {

+ if (n & 1) {
+ result *= x;
+ result += 1UL << (frac_bits - 1);
+ result >>= frac_bits;
+ }
+ n >>= 1;

+ if (!n)
+ break;
+ x *= x;
+ x += 1UL << (frac_bits - 1);

+ x >>= frac_bits;
+ }
+
+ return result;
+}
+
+/*
+ * a1 = a0 * e + a * (1 - e)
+ *
+ * a2 = a1 * e + a * (1 - e)
+ * = (a0 * e + a * (1 - e)) * e + a * (1 - e)
+ * = a0 * e^2 + a * (1 - e) * (1 + e)
+ *
+ * a3 = a2 * e + a * (1 - e)
+ * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
+ * = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
+ *

+ * ...
+ *
+ * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1) [1]

+ * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
+ * = a0 * e^n + a * (1 - e^n)
+ *

+ * [1] application of the geometric series:
+ *
+ * n 1 - x^(n+1)
+ * S_n := \Sum x^i = -------------
+ * i=0 1 - x

+ */
+static unsigned long
+calc_load_n(unsigned long load, unsigned long exp,
+ unsigned long active, unsigned int n)
+{
+

+ return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
+}
+
+static void calc_global_nohz(unsigned long ticks)
+{
+ long delta, active, n;
+
+ if (time_before(jiffies, calc_load_update))
+ return;
+
+ /*
+ * If we crossed a calc_load_update boundary, make sure to fold
+ * any pending idle changes, the respective CPUs might have
+ * missed the tick driven calc_load_account_active() update
+ * due to NO_HZ
+ */
+ delta = calc_load_fold_idle();
+ if (delta)
+ atomic_long_add(delta, &calc_load_tasks);
+
+ if (ticks >= LOAD_FREQ) {
+ n = ticks / LOAD_FREQ;
+
+ active = atomic_long_read(&calc_load_tasks);
+ active = active > 0 ? active * FIXED_1 : 0;

+
+ avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
+ avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
+ avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);

+

+ calc_load_update += n * LOAD_FREQ;
+ }

+}
#else
static void calc_load_account_idle(struct rq *this_rq)
{
@@ -3016,6 +3124,10 @@ static inline long calc_load_fold_idle(void)
{
return 0;
}
+
+static void calc_global_nohz(unsigned long ticks)
+{
+}
#endif

/**
@@ -3033,24 +3145,17 @@ void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
loads[2] = (avenrun[2] + offset) << shift;
}

-static unsigned long
-calc_load(unsigned long load, unsigned long exp, unsigned long active)
-{
- load *= exp;
- load += active * (FIXED_1 - exp);
- return load >> FSHIFT;
-}
-

/*
* calc_load - update the avenrun load estimates 10 ticks after the
* CPUs have updated calc_load_tasks.
*/
-void calc_global_load(void)

+void calc_global_load(unsigned long ticks)
{
- unsigned long upd = calc_load_update + 10;
long active;

- if (time_before(jiffies, upd))
+ calc_global_nohz(ticks);
+
+ if (time_before(jiffies, calc_load_update + 10))
return;

active = atomic_long_read(&calc_load_tasks);
diff --git a/kernel/timer.c b/kernel/timer.c
index d6ccb90..9f82b2a 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c

@@ -1297,7 +1297,7 @@ void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_wall_time();
- calc_global_load();
+ calc_global_load(ticks);
}

#ifdef __ARCH_WANT_SYS_ALARM

--

Kyle McMartin

unread,

Nov 30, 2010, 10:50:01 AM11/30/10

to

On Tue, Nov 30, 2010 at 03:59:05PM +0100, Peter Zijlstra wrote:
> OK, so here's a less crufty patch that gets the same result on my
> machine, load drops down to 0.00 after a while.
>
> It seems a bit slower to reach 0.00, but that could be because I
> actually changed the load computation for NO_HZ=n as well, I added a
> rounding factor in calc_load(), we no longer truncate the division.
>
> If people want to compare, simply remove the third line from
> calc_load(): load += 1UL << (FSHIFT - 1), to restore the old behaviour.
>

Thanks Peter, I'll build a couple kernels for the bugzilla reporters and
see what they turn up.

--Kyle

Damien Wyart

unread,

Nov 30, 2010, 12:00:03 PM11/30/10

to

* Peter Zijlstra <pet...@infradead.org> [101130 15:59]:

> OK, so here's a less crufty patch that gets the same result on my
> machine, load drops down to 0.00 after a while.

Seems OK too here.

> It seems a bit slower to reach 0.00, but that could be because
> I actually changed the load computation for NO_HZ=n as well, I added
> a rounding factor in calc_load(), we no longer truncate the division.

Yes, really feels slower, but in the two configurations as you write.
Then this is a matter of convention, do not know which version is more
"correct" or "natural", but this is coherent in the two modes (HZ and
NO_HZ).

Do you plan to push this one to upstream/stable, or do you plan further
modifications?

Thanks for your patches,
--
Damien Wyart

Damien Wyart

unread,

Nov 30, 2010, 12:10:01 PM11/30/10

to

* Peter Zijlstra <pet...@infradead.org> [101129 12:38]:

> > Doesn't give wrong numbers like initial bug and tentative patches, but
> > feels a bit too slow when numbers go up and down. Correct values are
> > reached when waiting long enough, but it feels slow.

> > As I've tested many combinations, maybe this is an impression because
> > I do not remember about "normal" delays for the load to rise and fall,
> > but this still feels slow.

> You can test this by either booting with nohz=off, or builting with
> CONFIG_NO_HZ=n and then comparing the result, something like

> make O=defconfig clean; while sleep 10; do uptime >> load.log; done &
> make -j32 O=defconfig; kill %1

> And comparing the curves between the NO_HZ and !NO_HZ kernels.

In fact, timings were very similar between the two configurations, so
I guess my feelings were wrong... They were also similar with the next
version of the patch.

--
Damien Wyart

Peter Zijlstra

unread,

Nov 30, 2010, 12:30:01 PM11/30/10

to

On Tue, 2010-11-30 at 17:53 +0100, Damien Wyart wrote:
> * Peter Zijlstra <pet...@infradead.org> [101130 15:59]:
> > OK, so here's a less crufty patch that gets the same result on my
> > machine, load drops down to 0.00 after a while.
>
> Seems OK too here.
>
> > It seems a bit slower to reach 0.00, but that could be because
> > I actually changed the load computation for NO_HZ=n as well, I added
> > a rounding factor in calc_load(), we no longer truncate the division.
>
> Yes, really feels slower, but in the two configurations as you write.
> Then this is a matter of convention, do not know which version is more
> "correct" or "natural", but this is coherent in the two modes (HZ and
> NO_HZ).
>
> Do you plan to push this one to upstream/stable, or do you plan further
> modifications?

No this is about it, I've written a bit more comments, and I'll need to
come up with a changelog, I'll also wait for Kyle's testers to come
back, but yes, then I'll push it upstream/stable.

tmhi...@gmail.com

unread,

Nov 30, 2010, 3:10:01 PM11/30/10

to

I haven't had time to test your further patches but THIS works!

14:57:03 up 14:01, 4 users, load average: 0.00, 0.00, 0.00

Load seems to finally be accurate on my machine compared to processes
running/whatever else usage. This is again testing vs the original commit
that caused the problems for me:

[74f5187ac873042f502227701ed1727e7c5fbfa9] sched: Cure load average vs NO_HZ woes

so I know I'm testing apples to apples here.

As time permits I'll test the later replies you made to yourself.

Thank you,
Tim McGrath

Kyle McMartin

unread,

Nov 30, 2010, 3:10:02 PM11/30/10

to

On Tue, Nov 30, 2010 at 10:39:31AM -0500, Kyle McMartin wrote:
> Thanks Peter, I'll build a couple kernels for the bugzilla reporters and
> see what they turn up.
>

Looks good, one tester says things are promising at least.

Thanks so much for all your effort, Peter.

cheers, Kyle

tmhi...@gmail.com

unread,

Dec 1, 2010, 4:30:03 PM12/1/10

to

On Tue, Nov 30, 2010 at 03:59:05PM +0100, Peter Zijlstra wrote:

> On Tue, 2010-11-30 at 00:01 +0100, Peter Zijlstra wrote:
> >
> > Ok, that's good testing.. so its still not quite the same as NO_HZ=n,
> > how about this one?
> >
> > (it seems to drop down to 0.00 if I wait a few minutes with top -d5)
>
> OK, so here's a less crufty patch that gets the same result on my
> machine, load drops down to 0.00 after a while.
>
> It seems a bit slower to reach 0.00, but that could be because I
> actually changed the load computation for NO_HZ=n as well, I added a
> rounding factor in calc_load(), we no longer truncate the division.
>
> If people want to compare, simply remove the third line from
> calc_load(): load += 1UL << (FSHIFT - 1), to restore the old behaviour.

For some bizzare reason, this version has a small but noticable amount of
jitter and never really seems to hit 0.00 on my machine, tends to jump
around at low values between 0.03 to 0.08 on a routine basis:

16:20:42 up 16:31, 4 users, load average: 0.00, 0.01, 0.05

the jitter seems to have no visible reason for it happening; with no
networking, disk access or a process waking up and demanding attention from
the cpu, it goes back up.

Mind this is obviously NOT as horrible as it was originally, but I'd like to
find out why it's acting so differently.

I'm going to try this variant again with that line you were talking about
disabled and see if it reacts differently. I get the feeling if it's the
rounding factor - since you say that was changed for BOTH nohz=y and n, that
it's not really a problem in the first place, and likely is very low load
that wasn't being accurately reported before.

Tim McGrath

tmhi...@gmail.com

unread,

Dec 2, 2010, 5:20:02 AM12/2/10

to

Indeed, this seems to be the case:

04:50:14 up 5:45, 5 users, load average: 0.00, 0.00, 0.00

the average seems to not be jittery, or at least noticably, and reacts as I
have expected it to in the past with that single line disabled; Since you
have said that this change would affect all load calculations I have not
tested how this patch with the line enabled/disabled reacts with nohz=n,
please let me know if you would like me to test that condition anyway.

Personally since it changes the previous behavior of the load calculation
I'd prefer that the rounding not be done.

Tim McGrath

tip-bot for Peter Zijlstra

unread,

Dec 8, 2010, 3:50:02 PM12/8/10

to

Commit-ID: 0f004f5a696a9434b7214d0d3cbd0525ee77d428
Gitweb: http://git.kernel.org/tip/0f004f5a696a9434b7214d0d3cbd0525ee77d428
Author: Peter Zijlstra <a.p.zi...@chello.nl>
AuthorDate: Tue, 30 Nov 2010 19:48:45 +0100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Wed, 8 Dec 2010 20:15:04 +0100

sched: Cure more NO_HZ load average woes

There's a long-running regression that proved difficult to fix and
which is hitting certain people and is rather annoying in its effects.

Damien reported that after 74f5187ac8 (sched: Cure load average vs
NO_HZ woes) his load average is unnaturally high, he also noted that
even with that patch reverted the load avgerage numbers are not
correct.

The problem is that the previous patch only solved half the NO_HZ
problem, it addressed the part of going into NO_HZ mode, not of
comming out of NO_HZ mode. This patch implements that missing half.

When comming out of NO_HZ mode there are two important things to take
care of:

- Folding the pending idle delta into the global active count.
- Correctly aging the averages for the idle-duration.

So with this patch the NO_HZ interaction should be complete and
behaviour between CONFIG_NO_HZ=[yn] should be equivalent.

Furthermore, this patch slightly changes the load average computation
by adding a rounding term to the fixed point multiplication.

Reported-by: Damien Wyart <damien...@free.fr>
Reported-by: Tim McGrath <tmhi...@gmail.com>
Tested-by: Damien Wyart <damien...@free.fr>
Tested-by: Orion Poplawski <or...@cora.nwra.com>
Tested-by: Kyle McMartin <ky...@mcmartin.ca>
Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: sta...@kernel.org
Cc: Chase Douglas <chase....@canonical.com>
LKML-Reference: <1291129145.32004.874.camel@laptop>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
include/linux/sched.h | 2 +-
kernel/sched.c | 150 +++++++++++++++++++++++++++++++++++++++++++++----
kernel/timer.c | 2 +-
3 files changed, 141 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c79e92..2238745 100644

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -143,7 +143,7 @@ extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);

-extern void calc_global_load(void);
+extern void calc_global_load(unsigned long ticks);

extern unsigned long get_parent_ip(unsigned long addr);

diff --git a/kernel/sched.c b/kernel/sched.c

index dc91a4d..6b7c26a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3119,6 +3119,15 @@ static long calc_load_fold_active(struct rq *this_rq)

return delta;
}

+static unsigned long
+calc_load(unsigned long load, unsigned long exp, unsigned long active)
+{
+ load *= exp;
+ load += active * (FIXED_1 - exp);
+ load += 1UL << (FSHIFT - 1);
+ return load >> FSHIFT;
+}
+
#ifdef CONFIG_NO_HZ
/*
* For NO_HZ we delay the active fold to the next LOAD_FREQ update.

@@ -3148,6 +3157,128 @@ static long calc_load_fold_idle(void)

+/*
+ * NO_HZ can leave us missing all per-cpu ticks calling
+ * calc_load_account_active(), but since an idle CPU folds its delta into
+ * calc_load_tasks_idle per calc_load_account_idle(), all we need to do is fold
+ * in the pending idle delta if our idle period crossed a load cycle boundary.
+ *
+ * Once we've updated the global active value, we need to apply the exponential
+ * weights adjusted to the number of cycles missed.
+ */

+static void calc_global_nohz(unsigned long ticks)
+{
+ long delta, active, n;
+
+ if (time_before(jiffies, calc_load_update))
+ return;
+
+ /*
+ * If we crossed a calc_load_update boundary, make sure to fold
+ * any pending idle changes, the respective CPUs might have
+ * missed the tick driven calc_load_account_active() update

+ * due to NO_HZ.

+ */
+ delta = calc_load_fold_idle();
+ if (delta)
+ atomic_long_add(delta, &calc_load_tasks);
+

+ /*
+ * If we were idle for multiple load cycles, apply them.
+ */

+ if (ticks >= LOAD_FREQ) {
+ n = ticks / LOAD_FREQ;
+
+ active = atomic_long_read(&calc_load_tasks);
+ active = active > 0 ? active * FIXED_1 : 0;
+
+ avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
+ avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
+ avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
+
+ calc_load_update += n * LOAD_FREQ;
+ }
+

+ /*
+ * Its possible the remainder of the above division also crosses
+ * a LOAD_FREQ period, the regular check in calc_global_load()
+ * which comes after this will take care of that.
+ *
+ * Consider us being 11 ticks before a cycle completion, and us
+ * sleeping for 4*LOAD_FREQ + 22 ticks, then the above code will
+ * age us 4 cycles, and the test in calc_global_load() will
+ * pick up the final one.
+ */

+}
#else
static void calc_load_account_idle(struct rq *this_rq)
{

@@ -3157,6 +3288,10 @@ static inline long calc_load_fold_idle(void)

{
return 0;
}
+
+static void calc_global_nohz(unsigned long ticks)
+{
+}
#endif

/**

@@ -3174,24 +3309,17 @@ void get_avenrun(unsigned long *loads, unsigned long offset, int shift)

loads[2] = (avenrun[2] + offset) << shift;
}

-static unsigned long
-calc_load(unsigned long load, unsigned long exp, unsigned long active)
-{
- load *= exp;
- load += active * (FIXED_1 - exp);
- return load >> FSHIFT;
-}
-
/*
* calc_load - update the avenrun load estimates 10 ticks after the
* CPUs have updated calc_load_tasks.
*/
-void calc_global_load(void)
+void calc_global_load(unsigned long ticks)
{
- unsigned long upd = calc_load_update + 10;
long active;

- if (time_before(jiffies, upd))
+ calc_global_nohz(ticks);
+
+ if (time_before(jiffies, calc_load_update + 10))
return;

active = atomic_long_read(&calc_load_tasks);
diff --git a/kernel/timer.c b/kernel/timer.c

index 68a9ae7..7bd715f 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1319,7 +1319,7 @@ void do_timer(unsigned long ticks)