With 2.6.36-rc6, I'm seeing a load of around 0.60 when the machine is
completely idle. This is similar to what someone reported for the latest
2.6.35.x stables. This is on a core i7 machine, but I've no time to
bisect or test earlier versions right now, but I guess this is easy to
reproduce on the same plateform.
Best,
--
Damien Wyart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> With 2.6.36-rc6, I'm seeing a load of around 0.60 when the machine is
> completely idle. This is similar to what someone reported for the latest
> 2.6.35.x stables. This is on a core i7 machine, but I've no time to
> bisect or test earlier versions right now, but I guess this is easy to
> reproduce on the same plateform.
After further investigation and cross-checking with the thread "PROBLEM:
Unusually high load average when idle in 2.6.35, 2.6.35.1 and later",
I came to the following results:
- the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim
seems to be the culprit;
- reverting it solves the problem with 2.6.36-rc7 in NOHZ mode: the load
when idle goes down to 0.00 (which it never does with the patch
applied)
- using nohz=no with the commit reverted still gives correct behaviour
(tested just in case)
- vanilla 2.6.36-rc7 with this commit applied has the problem (load is
around 0.60 when machine idle, sometimes less after several hours
of uptime, but never 0.00), and rebooting with nohz=no makes the
problem disappear: load goes down to 0.00 quickly after boot process
has finished.
I hope this answers the questions raised in the Tim's thread.
Could someone with knowledge of the commit take a look at the problem?
It would be a bit annoying to have this problem in 2.6.36, since Tim's
initial report dates back to 2 weeks ago...
I can help for further testing if needed.
Thanks in advance,
--
Damien
Sorry I haven't been as responsive to this issue as I would have liked.
I've been rather busy on other work.
My biggest testing concern is that in reality, load on a normal desktop
machine (i.e. not some stripped down machine disconnected from network
or any other input running nothing but busybox) should not be 0.00.
Maybe load on a server doing absolutely nothing could be 0.00, but
there's usually something going on that should bump it up to a few
hundredths. Watch top, and if you see at least 1% constant cpu usage
there, your load average should be at least 0.01. That said, there does
seem to be a bug somewhere as a load of 0.60 on an idle machine seems
high.
-- Chase
FWIW, I'm using htop which maybe behaves differently than top, but after
a few seconds or tens of seconds, values are really stabilizing at 0.00
(displaying 0.01 from time to time, but quite rarely). I also get 0.00
through calling "uptime".
These tests have been done on a big desktop machine accessed remotely
through ssh, so the usual graphical environment eyecandy and Web browser
are not running at all, and I have been careful not to run heavy
processes in parallel, so reaching 0.00 doesn't seem abnormal in that
case.
As you wrote, the "bad" case with the commit applied and nohz=yes really
seems wrong because in the same conditions, idle load is several tens
orders of magnitude larger.
--
Damien
> - the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim
> seems to be the culprit;
Right, so I think I figured out what's happening.
We're folding sucessive idles of the same cpu into the total idle
number, which is inflating things.
+/*
+ * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
+ *
+ * When making the ILB scale, we should try to pull this in as well.
+ */
+static atomic_long_t calc_load_tasks_idle;
+
+static void calc_load_account_idle(struct rq *this_rq)
+{
+ long delta;
+
+ delta = calc_load_fold_active(this_rq);
+ if (delta)
+ atomic_long_add(delta, &calc_load_tasks_idle);
+}
+
+static long calc_load_fold_idle(void)
+{
+ long delta = 0;
+
+ /*
+ * Its got a race, we don't care...
+ */
+ if (atomic_long_read(&calc_load_tasks_idle))
+ delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+
+ return delta;
+}
If you look at that and imagine CPU1 going idle with 1 task blocked,
then waking up due to unblocking, then going idle with that same task
block, etc.. all before we fold_idle on an active cpu, then we can count
that one task many times over.
I haven't come up with a sane patch yet, hackery below, but that does
let my 24-cpu system idle into load 0.0x instead of the constant 1.x it
had before.
Beware, utter hackery below.. lots of races not sure it matters but it
ain't pretty..
Anybody got a bright idea here?
---
kernel/sched.c | 32 ++++++++++++++++++++++++++++++--
kernel/sched_idletask.c | 1 +
2 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 91c19db..ac4512d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -521,6 +521,9 @@ struct rq {
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ
+ long calc_load_inactive;
+#endif
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
@@ -1817,6 +1820,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
#endif
static void calc_load_account_idle(struct rq *this_rq);
+static void calc_load_account_nonidle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);
@@ -2978,14 +2982,33 @@ static long calc_load_fold_active(struct rq *this_rq)
* When making the ILB scale, we should try to pull this in as well.
*/
static atomic_long_t calc_load_tasks_idle;
+static cpumask_var_t calc_load_mask;
static void calc_load_account_idle(struct rq *this_rq)
{
long delta;
delta = calc_load_fold_active(this_rq);
- if (delta)
+ this_rq->calc_load_inactive = delta;
+
+ if (delta) {
atomic_long_add(delta, &calc_load_tasks_idle);
+ cpumask_set_cpu(cpu_of(this_rq), calc_load_mask);
+ }
+
+ trace_printk("idle start: %d %Ld %Ld\n", cpu_of(this_rq), delta,
+ atomic_long_read(&calc_load_tasks_idle));
+}
+
+static void calc_load_account_nonidle(struct rq *this_rq)
+{
+ if (cpumask_test_and_clear_cpu(cpu_of(this_rq), calc_load_mask)) {
+ atomic_long_sub(this_rq->calc_load_inactive, &calc_load_tasks_idle);
+ trace_printk("idle end: %d %Ld %Ld\n", cpu_of(this_rq),
+ this_rq->calc_load_inactive,
+ atomic_long_read(&calc_load_tasks_idle));
+ } else
+ trace_printk("idle end: %d\n", cpu_of(this_rq));
}
static long calc_load_fold_idle(void)
@@ -2995,8 +3018,12 @@ static long calc_load_fold_idle(void)
/*
* Its got a race, we don't care...
*/
- if (atomic_long_read(&calc_load_tasks_idle))
+ if (atomic_long_read(&calc_load_tasks_idle)) {
delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+ cpumask_clear(calc_load_mask);
+ }
+
+ trace_printk("idle fold: %d %Ld\n", smp_processor_id(), delta);
return delta;
}
@@ -7935,6 +7962,7 @@ void __init sched_init(void)
atomic_set(&nohz.load_balancer, nr_cpu_ids);
atomic_set(&nohz.first_pick_cpu, nr_cpu_ids);
atomic_set(&nohz.second_pick_cpu, nr_cpu_ids);
+ zalloc_cpumask_var(&calc_load_mask, GFP_NOWAIT);
#endif
/* May be allocated at isolcpus cmdline parse time */
if (cpu_isolated_map == NULL)
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f402..a7fa1aa 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -42,6 +42,7 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)
static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
+ calc_load_account_nonidle(rq);
}
static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)
Thomas, Chase, any clue?
---
kernel/sched.c | 31 +++++++++++++++++++++++++------
kernel/sched_idletask.c | 1 +
2 files changed, 26 insertions(+), 6 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 3312c64..a56446b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -521,6 +521,10 @@ struct rq {
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ
+ long calc_load_idle;
+ int calc_load_seq;
+#endif
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
@@ -1817,6 +1821,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
#endif
static void calc_load_account_idle(struct rq *this_rq);
+static void calc_load_account_nonidle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);
@@ -2978,14 +2983,25 @@ static long calc_load_fold_active(struct rq *this_rq)
* When making the ILB scale, we should try to pull this in as well.
*/
static atomic_long_t calc_load_tasks_idle;
+static atomic_t calc_load_seq;
static void calc_load_account_idle(struct rq *this_rq)
{
- long delta;
+ long idle;
- delta = calc_load_fold_active(this_rq);
- if (delta)
- atomic_long_add(delta, &calc_load_tasks_idle);
+ idle = calc_load_fold_active(this_rq);
+ this_rq->calc_load_idle = idle;
+
+ if (idle) {
+ this_rq->calc_load_seq = atomic_read(&calc_load_seq);
+ atomic_long_add(idle, &calc_load_tasks_idle);
+ }
+}
+
+static void calc_load_account_nonidle(struct rq *this_rq)
+{
+ if (atomic_read(&calc_load_seq) == this_rq->calc_load_seq)
+ atomic_long_sub(this_rq->calc_load_idle, &calc_load_tasks_idle);
}
static long calc_load_fold_idle(void)
@@ -2993,10 +3009,13 @@ static long calc_load_fold_idle(void)
long delta = 0;
/*
- * Its got a race, we don't care...
+ * Its got races, we don't care... its only statistics after all.
*/
- if (atomic_long_read(&calc_load_tasks_idle))
+ if (atomic_long_read(&calc_load_tasks_idle)) {
delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+ if (delta)
+ atomic_inc(&calc_load_seq);
+ }
return delta;
> After further investigation and cross-checking with the thread "PROBLEM:
> Unusually high load average when idle in 2.6.35, 2.6.35.1 and later",
> I came to the following results:
> - the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim
> seems to be the culprit;
> - reverting it solves the problem with 2.6.36-rc7 in NOHZ mode: the load
> when idle goes down to 0.00 (which it never does with the patch
> applied)
In fact, after several hours of uptime, I also came into a situation of
the load being around 0.80 or 0.60 when idle with the commit reverted.
So IMHO, just reverting is not a better option than keeping the
offending commit, and a real rework of the code is needed to clean up
the situation.
Should'nt we enlarge the list of CC, because for now, responsivity has
been close to 0 and it seems we will get a 2.6.36 with buggy load avg
calculation. Even if it is only statistics, many supervision tools rely
on the load avg, so for production environments, this is not a good
thing.
Best,
It already contains all the folks who know the code I'm afraid.. :/
I've been playing with it a bit more today, but haven't actually managed
to make it better, just differently worse..
Ah, I just remembered Venki recently poked at this code too, maybe he's
got a bright idea..
Venki, there are cpu-load issues, the reported issue is that idle load
is too high, and I think I can see that happening with the current code
(due to 74f5187ac8).
The flaw I can see in that commit is that we can go idle multiple times
during the LOAD_FREQ window, which will basically inflate the idle
contribution.
All attempts from me to fix that so far have resulted in curious
results..
Would you have a moment to also look at this?
Just for reference this is my latest patch.. I figured that since its
NOHZ related it should actually be keyed of off the nohz code, not going
idle.
---
include/linux/sched.h | 8 ++++++++
kernel/sched.c | 28 +++++++++++++++++++++-------
kernel/sched_idletask.c | 1 -
kernel/time/tick-sched.c | 2 ++
4 files changed, 31 insertions(+), 8 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0383601..5311ef4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -145,6 +145,14 @@ extern unsigned long this_cpu_load(void);
extern void calc_global_load(void);
+#ifdef CONFIG_NO_HZ
+extern void calc_load_account_idle(void);
+extern void calc_load_account_nonidle(void);
+#else
+static inline void calc_load_account_idle(void) { }
+static inline void calc_load_account_nonidle(void) { }
+#endif
+
extern unsigned long get_parent_ip(unsigned long addr);
struct seq_file;
diff --git a/kernel/sched.c b/kernel/sched.c
index abf8440..79a29e6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -526,6 +526,10 @@ struct rq {
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ
+ long calc_load_inactive;
+ int calc_load_seq;
+#endif
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
@@ -1833,7 +1837,6 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
}
#endif
-static void calc_load_account_idle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);
@@ -3111,16 +3114,29 @@ static long calc_load_fold_active(struct rq *this_rq)
* When making the ILB scale, we should try to pull this in as well.
*/
static atomic_long_t calc_load_tasks_idle;
+static atomic_t calc_load_seq;
-static void calc_load_account_idle(struct rq *this_rq)
+void calc_load_account_idle(void)
{
+ struct rq *this_rq = this_rq();
long delta;
delta = calc_load_fold_active(this_rq);
+ this_rq->calc_load_inactive = delta;
+ this_rq->calc_load_seq = atomic_read(&calc_load_seq);
+
if (delta)
atomic_long_add(delta, &calc_load_tasks_idle);
}
+void calc_load_account_nonidle(void)
+{
+ struct rq *this_rq = this_rq();
+
+ if (atomic_read(&calc_load_seq) == this_rq->calc_load_seq)
+ atomic_long_add(this_rq->calc_load_inactive, &calc_load_tasks_idle);
+}
+
static long calc_load_fold_idle(void)
{
long delta = 0;
@@ -3128,16 +3144,14 @@ static long calc_load_fold_idle(void)
/*
* Its got a race, we don't care...
*/
- if (atomic_long_read(&calc_load_tasks_idle))
+ if (atomic_long_read(&calc_load_tasks_idle)) {
+ atomic_inc(&calc_load_seq);
delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+ }
return delta;
}
#else
-static void calc_load_account_idle(struct rq *this_rq)
-{
-}
-
static inline long calc_load_fold_idle(void)
{
return 0;
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f402..6ca191f 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -23,7 +23,6 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
static struct task_struct *pick_next_task_idle(struct rq *rq)
{
schedstat_inc(rq, sched_goidle);
- calc_load_account_idle(rq);
return rq->idle;
}
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3e216e0..808abd7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -411,6 +411,7 @@ void tick_nohz_stop_sched_tick(int inidle)
ts->tick_stopped = 1;
ts->idle_jiffies = last_jiffies;
rcu_enter_nohz();
+ calc_load_account_idle();
}
ts->idle_sleeps++;
@@ -520,6 +521,7 @@ void tick_nohz_restart_sched_tick(void)
ts->inidle = 0;
+ calc_load_account_nonidle();
rcu_exit_nohz();
/* Update jiffies first */
So that should read: that atomic_long_sub()
Trouble is, load goes down with that patch fixed, it just never goes
up :/
OK, how does this work for people? I find my idle load is still a tad
high, but maybe I'm not patient enough.
---
include/linux/sched.h | 8 ++++++++
kernel/sched.c | 32 ++++++++++++++++++++++++++------
kernel/sched_idletask.c | 1 -
kernel/time/tick-sched.c | 2 ++
4 files changed, 36 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0383601..5311ef4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -145,6 +145,14 @@ extern unsigned long this_cpu_load(void);
extern void calc_global_load(void);
+#ifdef CONFIG_NO_HZ
+extern void calc_load_account_idle(void);
+extern void calc_load_account_nonidle(void);
+#else
+static inline void calc_load_account_idle(void) { }
+static inline void calc_load_account_nonidle(void) { }
+#endif
+
extern unsigned long get_parent_ip(unsigned long addr);
struct seq_file;
diff --git a/kernel/sched.c b/kernel/sched.c
index abf8440..f214222 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -526,6 +526,10 @@ struct rq {
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ
+ long calc_load_inactive;
+ int calc_load_seq;
+#endif
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
@@ -1833,7 +1837,6 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
}
#endif
-static void calc_load_account_idle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);
@@ -3111,16 +3114,36 @@ static long calc_load_fold_active(struct rq *this_rq)
* When making the ILB scale, we should try to pull this in as well.
*/
static atomic_long_t calc_load_tasks_idle;
+static atomic_t calc_load_seq;
-static void calc_load_account_idle(struct rq *this_rq)
+void calc_load_account_idle(void)
{
+ struct rq *this_rq = this_rq();
long delta;
delta = calc_load_fold_active(this_rq);
+ this_rq->calc_load_inactive = delta;
+ this_rq->calc_load_seq = atomic_read(&calc_load_seq);
+
if (delta)
atomic_long_add(delta, &calc_load_tasks_idle);
}
+void calc_load_account_nonidle(void)
+{
+ struct rq *this_rq = this_rq();
+
+ if (atomic_read(&calc_load_seq) == this_rq->calc_load_seq) {
+ atomic_long_sub(this_rq->calc_load_inactive, &calc_load_tasks_idle);
+ /*
+ * Undo the _fold_active() from _account_idle(). This
+ * avoids us loosing active tasks and creating a negative
+ * bias
+ */
+ this_rq->calc_load_active -= this_rq->calc_load_inactive;
+ }
+}
+
static long calc_load_fold_idle(void)
{
long delta = 0;
@@ -3128,16 +3151,13 @@ static long calc_load_fold_idle(void)
/*
* Its got a race, we don't care...
*/
+ atomic_inc(&calc_load_seq);
if (atomic_long_read(&calc_load_tasks_idle))
delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
Looks quite fine after some basic tests, much saner than without the
patch. A bit slow to go down, but I reach 0.00 after enough time being
idle.
Can't tell about the behavior after hours of uptime, of course, but the
values during the first minutes after bootup seems OK; without the
patch, they are evidently wrong...
Maybe commit this after some more reviews in the next 1 or 2 days, and
maybe think about further tweaking during 2.6.37?
--
Damien
I haven't had a chance to keep up with the topic, and I apologize. I'll be
testing this as soon as I can finish compiling it. Thank you all for not
letting this go unfixed.
Tim McGrath
Try -tip (which includes the scheduler development tree as well):
http://people.redhat.com/mingo/tip.git/README
Thanks,
Ingo
Tried that, patch still doesn't apply... and I just figured out why. Looks
like my email client is screwing the patch up. mutt apparently wants to chew
on my mail before I get it. viewing the mail as an attachment and saving it
works properly however.
Now that I have properly saved the mail, it applies cleanly to tip/master as
well as 74f5187ac873042f502227701ed1727e7c5fbfa9 - though in the latter's
case it's having to fuzz around a bit. I'll try testing
74f5187ac873042f502227701ed1727e7c5fbfa9 first since it's the one I *know*
is flawed, and I want to reduce the amount of changes that I have to test
for.
I'll build and test it, then let you guys know if there's any noticable
difference.
Tim McGrath
Ok, so while trying to write a changelog on this patch I got myself
terribly confused again..
calc_load_active_fold() is a relative operation and simply gives delta
values since the last time it got called. That means that the sum of
multiple invocations in a given time interval should be identical to a
single invocation.
Therefore, the going idle multiple times during LOAD_FREQ hypothesis
doesn't really make sense.
Even if it became idle but wasn't idle at the LOAD_FREQ turn-over it
shouldn't matter, since the calc_load_account_active() call will simply
fold the remaining delta with the accrued idle delta and the total
should all match up once we fold into the global calc_load_tasks.
So afaict its should all have worked and this patch is a big NOP,.
except it isn't..
Damn I hate this bug.. ;-) Anybody?
Yes. Thats what I was thinking trying to understand this code yesterday.
Also with sequence number I don't think nr_interruptible would be
handled correctly
as tasks can move to CPU after it first went idle and may not get
accounted later.
I somehow feel the problem is with nr_interruptible, which gets
accounted multiple
times on idle tasks and only once per LOAD_FREQ on busy tasks.
However, things are
not fully clear to me yet. Have to look at the code a bit more.
Thanks,
Venki
Now that I've actually had a chance to boot the kernel with the patch
applied I'm sorry to say but the load average isn't decaying as fast as it
ought to, at the very least. My machine's been idle for the last ten minutes
but the one minute average is still at 0.89 and shooting up to 1.5, the 5
min average is 0.9, and the 15 min average is .68 and climbing. Even as I'm
writing this the averages are continuing to drop, but *very* slowly.
Glacially, almost. The one minute average is continuing to randomly spike
high for no reason I can tell as well.
I'll let you guys know if this actually bottoms out at some point.
Tim McGrath
It did not. When I came home and checked, my load average was a
steady 0.7-0.8 across the board on all averages with the machine idle since
six hours ago. I guess the patch didn't fix the problem for me. If you want,
I'll try building master/tip with the patch applied, but I doubt it'll
really be different.
On the plus side, the patch did do something - it seems much less
erratic than it used to be for whatever reason, and now just has a very
steady load average rather than jumping about as it does without the patch
applied.
I wish I understood the code enough to know what is going wrong
here. I have to wonder what impact the original bug was causing. It seems to
me that if it only affected a few people it might be worth backing out the
patch and rethinking the problem it was meant to fix. On the other hand, if
there's some way of diagnosing the problem I'm all for it - is there some
kprintfs or something I could put in the code to find out when it's doing
unlikely or 'impossible' things? There is serious weirdness going on here
and I'd like to figure out the cause of it. I get the impression we're all
bumbling about in the dark poking at a gigantic elephant and getting the
wrong impressions.
Tim McGrath
So, I tried to simplify things and doing the updates directly from idle loop.
This is only a test patch, and eventually we need to hook it off somewhere
else, instead of idle loop and also this is expected work only as x86_64
right now.
Peter: Do you think something like this will work? loadavg went
quite on two of my test systems after this change (4 cpu and 24 cpu).
Thanks,
Venki
---
arch/x86/kernel/process_64.c | 2 +
kernel/sched.c | 67 +++++++++++------------------------------
kernel/sched_idletask.c | 1 -
3 files changed, 20 insertions(+), 50 deletions(-)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..aaa8025 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -101,6 +101,7 @@ static inline void play_dead(void)
}
#endif
+void idle_load_update(void);
/*
* The idle thread. There's no useful work to be
* done, so just try to conserve power and have a
@@ -140,6 +141,7 @@ void cpu_idle(void)
stop_critical_timings();
pm_idle();
start_critical_timings();
+ idle_load_update();
trace_power_end(smp_processor_id());
diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..6d589c1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1819,7 +1819,6 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
}
#endif
-static void calc_load_account_idle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);
@@ -2959,11 +2958,12 @@ static unsigned long calc_load_update;
unsigned long avenrun[3];
EXPORT_SYMBOL(avenrun);
-static long calc_load_fold_active(struct rq *this_rq)
+static long calc_load_fold(struct rq *this_rq, int idle)
{
- long nr_active, delta = 0;
+ long nr_active = 0, delta = 0;
- nr_active = this_rq->nr_running;
+ if (!idle)
+ nr_active = this_rq->nr_running;
nr_active += (long) this_rq->nr_uninterruptible;
if (nr_active != this_rq->calc_load_active) {
@@ -2974,46 +2974,6 @@ static long calc_load_fold_active(struct rq *this_rq)
return delta;
}
-#ifdef CONFIG_NO_HZ
-/*
- * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
- *
- * When making the ILB scale, we should try to pull this in as well.
- */
-static atomic_long_t calc_load_tasks_idle;
-
-static void calc_load_account_idle(struct rq *this_rq)
-{
- long delta;
-
- delta = calc_load_fold_active(this_rq);
- if (delta)
- atomic_long_add(delta, &calc_load_tasks_idle);
-}
-
-static long calc_load_fold_idle(void)
-{
- long delta = 0;
-
- /*
- * Its got a race, we don't care...
- */
- if (atomic_long_read(&calc_load_tasks_idle))
- delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
-
- return delta;
-}
-#else
-static void calc_load_account_idle(struct rq *this_rq)
-{
-}
-
-static inline long calc_load_fold_idle(void)
-{
- return 0;
-}
-#endif
-
/**
* get_avenrun - get the load average array
* @loads: pointer to dest load array
@@ -3043,7 +3003,7 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
*/
void calc_global_load(void)
{
- unsigned long upd = calc_load_update + 10;
+ unsigned long upd = calc_load_update + LOAD_FREQ/2;
long active;
if (time_before(jiffies, upd))
@@ -3063,21 +3023,30 @@ void calc_global_load(void)
* Called from update_cpu_load() to periodically update this CPU's
* active count.
*/
-static void calc_load_account_active(struct rq *this_rq)
+static void calc_load_account(struct rq *this_rq, int idle)
{
long delta;
if (time_before(jiffies, this_rq->calc_load_update))
return;
- delta = calc_load_fold_active(this_rq);
- delta += calc_load_fold_idle();
+ delta = calc_load_fold(this_rq, idle);
if (delta)
atomic_long_add(delta, &calc_load_tasks);
this_rq->calc_load_update += LOAD_FREQ;
}
+void idle_load_update(void)
+{
+ struct rq *rq = this_rq();
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ calc_load_account(rq, 1);
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+
/*
* The exact cpuload at various idx values, calculated at every tick would be
* load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load
@@ -3194,7 +3163,7 @@ static void update_cpu_load_active(struct rq *this_rq)
{
update_cpu_load(this_rq);
- calc_load_account_active(this_rq);
+ calc_load_account(this_rq, 0);
}
#ifdef CONFIG_SMP
diff --git a/kernel/sched_idletask.c b/kernel/sched_idletask.c
index 9fa0f40..6ca191f 100644
--- a/kernel/sched_idletask.c
+++ b/kernel/sched_idletask.c
@@ -23,7 +23,6 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
static struct task_struct *pick_next_task_idle(struct rq *rq)
{
schedstat_inc(rq, sched_goidle);
- calc_load_account_idle(rq);
return rq->idle;
}
--
1.7.1
Thanks,
Venki
-static void calc_load_account_idle(struct rq *this_rq)
-{
- long delta;
-
- delta = calc_load_fold_active(this_rq);
- if (delta)
- atomic_long_add(delta, &calc_load_tasks_idle);
-}
-
-static long calc_load_fold_idle(void)
-{
- long delta = 0;
-
- /*
- * Its got a race, we don't care...
- */
- if (atomic_long_read(&calc_load_tasks_idle))
- delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
-
- return delta;
-}
-#else
-static void calc_load_account_idle(struct rq *this_rq)
--
I'd really like to be able to help test this, but with a 32bit x86
machine I guess this patch won't do anything for me. How hard would it be to
mangle this into working for my machine or should I just wait?
Tim McGrath
Yes. Thought about that. One problem there is that works with nohz_idle_balance,
which will not be called if all the CPUs are idle for example.
As this is once in 5 secs, probably doing nr_running() and
nr_uninterruptible() should
be OK even on huge systems. But, that was the original code here, except that
it was inside xtime_lock.
Thanks,
Venki
The crude patch would be something like the below.. a smarter patch will
try and avoid that loop.
---
include/linux/sched.h | 2 +-
kernel/sched.c | 20 +++++++++-----------
kernel/timer.c | 2 +-
3 files changed, 11 insertions(+), 13 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7a6e81f..84c1bf1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -143,7 +143,7 @@ extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);
-extern void calc_global_load(void);
+extern void calc_global_load(int ticks);
extern unsigned long get_parent_ip(unsigned long addr);
diff --git a/kernel/sched.c b/kernel/sched.c
index 41f1869..49a2baf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3171,22 +3171,20 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
* calc_load - update the avenrun load estimates 10 ticks after the
* CPUs have updated calc_load_tasks.
*/
-void calc_global_load(void)
+void calc_global_load(int ticks)
{
- unsigned long upd = calc_load_update + 10;
long active;
- if (time_before(jiffies, upd))
- return;
-
- active = atomic_long_read(&calc_load_tasks);
- active = active > 0 ? active * FIXED_1 : 0;
+ while (!time_before(jiffies, calc_load_update + 10)) {
+ active = atomic_long_read(&calc_load_tasks);
+ active = active > 0 ? active * FIXED_1 : 0;
- avenrun[0] = calc_load(avenrun[0], EXP_1, active);
- avenrun[1] = calc_load(avenrun[1], EXP_5, active);
- avenrun[2] = calc_load(avenrun[2], EXP_15, active);
+ avenrun[0] = calc_load(avenrun[0], EXP_1, active);
+ avenrun[1] = calc_load(avenrun[1], EXP_5, active);
+ avenrun[2] = calc_load(avenrun[2], EXP_15, active);
- calc_load_update += LOAD_FREQ;
+ calc_load_update += LOAD_FREQ;
+ }
}
/*
diff --git a/kernel/timer.c b/kernel/timer.c
index d6ccb90..9f82b2a 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1297,7 +1297,7 @@ void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_wall_time();
- calc_global_load();
+ calc_global_load(ticks);
}
#ifdef __ARCH_WANT_SYS_ALARM
a1 = a0 * e + a * (1 - e)
a2 = a1 * e + a * (1 - e)
= (a0 * e + a * (1 - e)) * e + a * (1 - e)
= a0 * e^2 + a * (1 - e) * (1 + e)
a3 = a2 * e + a * (1 - e)
= (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
= a0 * e^3 + a * (1 - e) * (1 + e + e^2)
an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1)
= a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
= a0 * e^n + a * (1 - e^n)
the trouble seems to be that that is a rather slow function, stuffing
that in a table will either give us very large tables or very coarse
decay.
n: an an * 2048
1: 0.99446 2037
2: 0.98895 2025
4: 0.978023 2003
8: 0.956529 1959
16: 0.914947 1874
32: 0.837128 1714
64: 0.700784 1435
128: 0.491098 1006
256: 0.241177 494
512: 0.0581666 119
1024: 0.00338335 7
2048: 1.14471e-05 0
not tested other than compile, but how does that look...?
---
include/linux/sched.h | 2 +-
kernel/sched.c | 74 +++++++++++++++++++++++++++++++++++++++++++++---
kernel/timer.c | 2 +-
3 files changed, 71 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b3d07df..f878db3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -143,7 +143,7 @@ extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);
-extern void calc_global_load(void);
+extern void calc_global_load(int ticks);
extern unsigned long get_parent_ip(unsigned long addr);
diff --git a/kernel/sched.c b/kernel/sched.c
index 41f1869..125417e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3168,10 +3168,64 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
}
/*
+ * fixed_power_int - compute: x^n in O(log n) time
+ * @x: base of the power
+ * frac_bits: fractional bits of @x
+ * @n: power to raise @x to.
+ */
+static unsigned long
+fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
+{
+ unsigned long result = 1UL << frac_bits;
+
+ while (n) {
+ if (n & 1) {
+ result *= x;
+ result += 1UL << (frac_bits - 1);
+ result >>= frac_bits;
+ }
+ n >>= 1;
+ x *= x;
+ x >>= frac_bits;
+ }
+
+ return result;
+}
+
+/*
+ * a1 = a0 * e + a * (1 - e)
+ *
+ * a2 = a1 * e + a * (1 - e)
+ * = (a0 * e + a * (1 - e)) * e + a * (1 - e)
+ * = a0 * e^2 + a * (1 - e) * (1 + e)
+ *
+ * a3 = a2 * e + a * (1 - e)
+ * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
+ * = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
+ *
+ * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1)
+ * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
+ * = a0 * e^n + a * (1 - e^n)
+ */
+static unsigned long
+calc_load_n(unsigned long load, unsigned long exp,
+ unsigned long active, unsigned int n)
+{
+ unsigned long en = fixed_power_int(exp, FSHIFT, n);
+
+ load *= en;
+ load += active * (FIXED_1 - en);
+ load += 1UL << (FSHIFT - 1);
+ load >>= FSHIFT;
+
+ return load;
+}
+
+/*
* calc_load - update the avenrun load estimates 10 ticks after the
* CPUs have updated calc_load_tasks.
*/
-void calc_global_load(void)
+void calc_global_load(int ticks)
{
unsigned long upd = calc_load_update + 10;
long active;
@@ -3182,11 +3236,21 @@ void calc_global_load(void)
active = atomic_long_read(&calc_load_tasks);
active = active > 0 ? active * FIXED_1 : 0;
- avenrun[0] = calc_load(avenrun[0], EXP_1, active);
- avenrun[1] = calc_load(avenrun[1], EXP_5, active);
- avenrun[2] = calc_load(avenrun[2], EXP_15, active);
+ if (ticks > LOAD_FREQ) {
+ int n = (jiffies - upd) / LOAD_FREQ;
+
+ avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
+ avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
+ avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
- calc_load_update += LOAD_FREQ;
+ calc_load_update += n * LOAD_FREQ;
+ } else {
+ avenrun[0] = calc_load(avenrun[0], EXP_1, active);
+ avenrun[1] = calc_load(avenrun[1], EXP_5, active);
+ avenrun[2] = calc_load(avenrun[2], EXP_15, active);
+
+ calc_load_update += LOAD_FREQ;
+ }
}
/*
diff --git a/kernel/timer.c b/kernel/timer.c
index d6ccb90..9f82b2a 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1297,7 +1297,7 @@ void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_wall_time();
- calc_global_load();
+ calc_global_load(ticks);
}
#ifdef __ARCH_WANT_SYS_ALARM
--
Hi Peter,
I finally got around to building a test kernel for a user who had been
reporting this against F-14 (2.6.35) but they report there's not much
improvement.
Sorry. :/
<quote>
Date: Tue, 09 Nov 2010 12:46:53 -0600
From: Michael Cronenworth <mi...@cchtml.com>
To: te...@lists.fedoraproject.org
Subject: Re: Virtually impossible to keep load under 0.5 on idle desktop with
F14+kms
On 11/09/2010 11:55 AM, Kyle McMartin wrote:
> http://kyle.fedorapeople.org/kernel/2.6.35.6-52.sched1.fc14/
Running this kernel is not any better. After leaving my system idle for
10 minutes I finally saw it dip below 0.10, but after leaving the
computer, and coming back 10 minutes later, I see a load of 0.25. No
programs running.
Before I hit "send" on this message, I ran uptime again and got this:
$ uptime
12:45:24 up 32 min, 3 users, load average: 0.82, 0.43, 0.34
Thunderbird was the only program running. 99.8% idle.
</quote>
regards, Kyle
Right, I tested it myself yesterday and got similar results. Does the
'simple' patch from a few posts back work for people?
http://lkml.org/lkml/2010/10/26/150
If that does work I fudged the fancy math, if that doesn't work either
we're back to square zero.