With 2.6.36-rc6, I'm seeing a load of around 0.60 when the machine is completely idle. This is similar to what someone reported for the latest 2.6.35.x stables. This is on a core i7 machine, but I've no time to bisect or test earlier versions right now, but I guess this is easy to reproduce on the same plateform.
Discussion subject changed to "High CPU load when machine is idle (related to PROBLEM: Unusually high load average when idle in 2.6.35, 2.6.35.1 and later)" by Damien Wyart
> With 2.6.36-rc6, I'm seeing a load of around 0.60 when the machine is > completely idle. This is similar to what someone reported for the latest > 2.6.35.x stables. This is on a core i7 machine, but I've no time to > bisect or test earlier versions right now, but I guess this is easy to > reproduce on the same plateform.
After further investigation and cross-checking with the thread "PROBLEM: Unusually high load average when idle in 2.6.35, 2.6.35.1 and later", I came to the following results:
- the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim seems to be the culprit; - reverting it solves the problem with 2.6.36-rc7 in NOHZ mode: the load when idle goes down to 0.00 (which it never does with the patch applied) - using nohz=no with the commit reverted still gives correct behaviour (tested just in case)
- vanilla 2.6.36-rc7 with this commit applied has the problem (load is around 0.60 when machine idle, sometimes less after several hours of uptime, but never 0.00), and rebooting with nohz=no makes the problem disappear: load goes down to 0.00 quickly after boot process has finished.
I hope this answers the questions raised in the Tim's thread.
Could someone with knowledge of the commit take a look at the problem? It would be a bit annoying to have this problem in 2.6.36, since Tim's initial report dates back to 2 weeks ago...
On Thu, 2010-10-14 at 16:58 +0200, Damien Wyart wrote: > Hello,
> > With 2.6.36-rc6, I'm seeing a load of around 0.60 when the machine is > > completely idle. This is similar to what someone reported for the latest > > 2.6.35.x stables. This is on a core i7 machine, but I've no time to > > bisect or test earlier versions right now, but I guess this is easy to > > reproduce on the same plateform.
> After further investigation and cross-checking with the thread "PROBLEM: > Unusually high load average when idle in 2.6.35, 2.6.35.1 and later", > I came to the following results:
> - the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim > seems to be the culprit; > - reverting it solves the problem with 2.6.36-rc7 in NOHZ mode: the load > when idle goes down to 0.00 (which it never does with the patch > applied) > - using nohz=no with the commit reverted still gives correct behaviour > (tested just in case)
> - vanilla 2.6.36-rc7 with this commit applied has the problem (load is > around 0.60 when machine idle, sometimes less after several hours > of uptime, but never 0.00), and rebooting with nohz=no makes the > problem disappear: load goes down to 0.00 quickly after boot process > has finished.
> I hope this answers the questions raised in the Tim's thread.
> Could someone with knowledge of the commit take a look at the problem? > It would be a bit annoying to have this problem in 2.6.36, since Tim's > initial report dates back to 2 weeks ago...
> I can help for further testing if needed.
Sorry I haven't been as responsive to this issue as I would have liked. I've been rather busy on other work.
My biggest testing concern is that in reality, load on a normal desktop machine (i.e. not some stripped down machine disconnected from network or any other input running nothing but busybox) should not be 0.00. Maybe load on a server doing absolutely nothing could be 0.00, but there's usually something going on that should bump it up to a few hundredths. Watch top, and if you see at least 1% constant cpu usage there, your load average should be at least 0.01. That said, there does seem to be a bug somewhere as a load of 0.60 on an idle machine seems high.
* Chase Douglas <chase.doug...@canonical.com> [101014 17:29]:
> My biggest testing concern is that in reality, load on a normal > desktop machine (i.e. not some stripped down machine disconnected from > network or any other input running nothing but busybox) should not be > 0.00. Maybe load on a server doing absolutely nothing could be 0.00, > but there's usually something going on that should bump it up to a few > hundredths. Watch top, and if you see at least 1% constant cpu usage > there, your load average should be at least 0.01. That said, there > does seem to be a bug somewhere as a load of 0.60 on an idle machine > seems high.
FWIW, I'm using htop which maybe behaves differently than top, but after a few seconds or tens of seconds, values are really stabilizing at 0.00 (displaying 0.01 from time to time, but quite rarely). I also get 0.00 through calling "uptime".
These tests have been done on a big desktop machine accessed remotely through ssh, so the usual graphical environment eyecandy and Web browser are not running at all, and I have been careful not to run heavy processes in parallel, so reaching 0.00 doesn't seem abnormal in that case.
As you wrote, the "bad" case with the commit applied and nohz=yes really seems wrong because in the same conditions, idle load is several tens orders of magnitude larger.
On Thu, 2010-10-14 at 16:58 +0200, Damien Wyart wrote: > - the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim > seems to be the culprit;
Right, so I think I figured out what's happening.
We're folding sucessive idles of the same cpu into the total idle number, which is inflating things.
+/* + * For NO_HZ we delay the active fold to the next LOAD_FREQ update. + * + * When making the ILB scale, we should try to pull this in as well. + */ +static atomic_long_t calc_load_tasks_idle; + +static void calc_load_account_idle(struct rq *this_rq) +{ + long delta; + + delta = calc_load_fold_active(this_rq); + if (delta) + atomic_long_add(delta, &calc_load_tasks_idle); +} + +static long calc_load_fold_idle(void) +{ + long delta = 0; + + /* + * Its got a race, we don't care... + */ + if (atomic_long_read(&calc_load_tasks_idle)) + delta = atomic_long_xchg(&calc_load_tasks_idle, 0); + + return delta; +}
If you look at that and imagine CPU1 going idle with 1 task blocked, then waking up due to unblocking, then going idle with that same task block, etc.. all before we fold_idle on an active cpu, then we can count that one task many times over.
I haven't come up with a sane patch yet, hackery below, but that does let my 24-cpu system idle into load 0.0x instead of the constant 1.x it had before.
Beware, utter hackery below.. lots of races not sure it matters but it ain't pretty..
static void calc_load_account_idle(struct rq *this_rq); +static void calc_load_account_nonidle(struct rq *this_rq); static void update_sysctl(void); static int get_update_sysctl_factor(void); static void update_cpu_load(struct rq *this_rq); @@ -2978,14 +2982,33 @@ static long calc_load_fold_active(struct rq *this_rq) * When making the ILB scale, we should try to pull this in as well. */ static atomic_long_t calc_load_tasks_idle; +static cpumask_var_t calc_load_mask;
static void calc_load_account_idle(struct rq *this_rq) { long delta;
On Fri, 2010-10-15 at 13:08 +0200, Peter Zijlstra wrote: > On Thu, 2010-10-14 at 16:58 +0200, Damien Wyart wrote:
> > - the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim > > seems to be the culprit;
> Right, so I think I figured out what's happening.
> We're folding sucessive idles of the same cpu into the total idle > number, which is inflating things.
> +/* > + * For NO_HZ we delay the active fold to the next LOAD_FREQ update. > + * > + * When making the ILB scale, we should try to pull this in as well. > + */ > +static atomic_long_t calc_load_tasks_idle; > + > +static void calc_load_account_idle(struct rq *this_rq) > +{ > + long delta; > + > + delta = calc_load_fold_active(this_rq); > + if (delta) > + atomic_long_add(delta, &calc_load_tasks_idle); > +} > + > +static long calc_load_fold_idle(void) > +{ > + long delta = 0; > + > + /* > + * Its got a race, we don't care... > + */ > + if (atomic_long_read(&calc_load_tasks_idle)) > + delta = atomic_long_xchg(&calc_load_tasks_idle, 0); > + > + return delta; > +}
> If you look at that and imagine CPU1 going idle with 1 task blocked, > then waking up due to unblocking, then going idle with that same task > block, etc.. all before we fold_idle on an active cpu, then we can count > that one task many times over.
OK, I came up with the below, but its not quite working, load continues to decrease even though I've got a make -j64 running..
static void calc_load_account_idle(struct rq *this_rq); +static void calc_load_account_nonidle(struct rq *this_rq); static void update_sysctl(void); static int get_update_sysctl_factor(void); static void update_cpu_load(struct rq *this_rq); @@ -2978,14 +2983,25 @@ static long calc_load_fold_active(struct rq *this_rq) * When making the ILB scale, we should try to pull this in as well. */ static atomic_long_t calc_load_tasks_idle; +static atomic_t calc_load_seq;
static void calc_load_account_idle(struct rq *this_rq) { - long delta; + long idle;
static long calc_load_fold_idle(void) @@ -2993,10 +3009,13 @@ static long calc_load_fold_idle(void) long delta = 0;
/* - * Its got a race, we don't care... + * Its got races, we don't care... its only statistics after all. */ - if (atomic_long_read(&calc_load_tasks_idle)) + if (atomic_long_read(&calc_load_tasks_idle)) { delta = atomic_long_xchg(&calc_load_tasks_idle, 0); + if (delta) + atomic_inc(&calc_load_seq); + }
> > With 2.6.36-rc6, I'm seeing a load of around 0.60 when the machine > > is completely idle. This is similar to what someone reported for the > > latest 2.6.35.x stables. This is on a core i7 machine, but I've no > > time to bisect or test earlier versions right now, but I guess this > > is easy to reproduce on the same plateform. > After further investigation and cross-checking with the thread "PROBLEM: > Unusually high load average when idle in 2.6.35, 2.6.35.1 and later", > I came to the following results: > - the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim > seems to be the culprit; > - reverting it solves the problem with 2.6.36-rc7 in NOHZ mode: the load > when idle goes down to 0.00 (which it never does with the patch > applied)
In fact, after several hours of uptime, I also came into a situation of the load being around 0.80 or 0.60 when idle with the commit reverted. So IMHO, just reverting is not a better option than keeping the offending commit, and a real rework of the code is needed to clean up the situation.
Should'nt we enlarge the list of CC, because for now, responsivity has been close to 0 and it seems we will get a 2.6.36 with buggy load avg calculation. Even if it is only statistics, many supervision tools rely on the load avg, so for production environments, this is not a good thing.
On Wed, 2010-10-20 at 15:27 +0200, Damien Wyart wrote:
> Should'nt we enlarge the list of CC, because for now, responsivity has > been close to 0 and it seems we will get a 2.6.36 with buggy load avg > calculation. Even if it is only statistics, many supervision tools rely > on the load avg, so for production environments, this is not a good > thing.
It already contains all the folks who know the code I'm afraid.. :/
I've been playing with it a bit more today, but haven't actually managed to make it better, just differently worse.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, 2010-10-20 at 15:30 +0200, Peter Zijlstra wrote: > On Wed, 2010-10-20 at 15:27 +0200, Damien Wyart wrote:
> > Should'nt we enlarge the list of CC, because for now, responsivity has > > been close to 0 and it seems we will get a 2.6.36 with buggy load avg > > calculation. Even if it is only statistics, many supervision tools rely > > on the load avg, so for production environments, this is not a good > > thing.
> It already contains all the folks who know the code I'm afraid.. :/
> I've been playing with it a bit more today, but haven't actually managed > to make it better, just differently worse..
Ah, I just remembered Venki recently poked at this code too, maybe he's got a bright idea..
Venki, there are cpu-load issues, the reported issue is that idle load is too high, and I think I can see that happening with the current code (due to 74f5187ac8).
The flaw I can see in that commit is that we can go idle multiple times during the LOAD_FREQ window, which will basically inflate the idle contribution.
All attempts from me to fix that so far have resulted in curious results..
Would you have a moment to also look at this? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, 2010-10-20 at 15:43 +0200, Peter Zijlstra wrote: > On Wed, 2010-10-20 at 15:30 +0200, Peter Zijlstra wrote: > > On Wed, 2010-10-20 at 15:27 +0200, Damien Wyart wrote:
> > > Should'nt we enlarge the list of CC, because for now, responsivity has > > > been close to 0 and it seems we will get a 2.6.36 with buggy load avg > > > calculation. Even if it is only statistics, many supervision tools rely > > > on the load avg, so for production environments, this is not a good > > > thing.
> > It already contains all the folks who know the code I'm afraid.. :/
> > I've been playing with it a bit more today, but haven't actually managed > > to make it better, just differently worse..
> Ah, I just remembered Venki recently poked at this code too, maybe he's > got a bright idea..
> Venki, there are cpu-load issues, the reported issue is that idle load > is too high, and I think I can see that happening with the current code > (due to 74f5187ac8).
> The flaw I can see in that commit is that we can go idle multiple times > during the LOAD_FREQ window, which will basically inflate the idle > contribution.
> All attempts from me to fix that so far have resulted in curious > results..
> Would you have a moment to also look at this?
Just for reference this is my latest patch.. I figured that since its NOHZ related it should actually be keyed of off the nohz code, not going idle.
-static void calc_load_account_idle(struct rq *this_rq); static void update_sysctl(void); static int get_update_sysctl_factor(void); static void update_cpu_load(struct rq *this_rq); @@ -3111,16 +3114,29 @@ static long calc_load_fold_active(struct rq *this_rq) * When making the ILB scale, we should try to pull this in as well. */ static atomic_long_t calc_load_tasks_idle; +static atomic_t calc_load_seq;
-static void calc_load_account_idle(struct rq *this_rq); static void update_sysctl(void); static int get_update_sysctl_factor(void); static void update_cpu_load(struct rq *this_rq); @@ -3111,16 +3114,36 @@ static long calc_load_fold_active(struct rq *this_rq) * When making the ILB scale, we should try to pull this in as well. */ static atomic_long_t calc_load_tasks_idle; +static atomic_t calc_load_seq;
* Peter Zijlstra <pet...@infradead.org> [101020 19:26]:
> OK, how does this work for people? I find my idle load is still a tad > high, but maybe I'm not patient enough.
Looks quite fine after some basic tests, much saner than without the patch. A bit slow to go down, but I reach 0.00 after enough time being idle.
Can't tell about the behavior after hours of uptime, of course, but the values during the first minutes after bootup seems OK; without the patch, they are evidently wrong...
Maybe commit this after some more reviews in the next 1 or 2 days, and maybe think about further tweaking during 2.6.37?
On Wed, Oct 20, 2010 at 07:26:45PM +0200, Peter Zijlstra wrote:
> OK, how does this work for people? I find my idle load is still a tad > high, but maybe I'm not patient enough.
I haven't had a chance to keep up with the topic, and I apologize. I'll be testing this as soon as I can finish compiling it. Thank you all for not letting this go unfixed.
On Wed, Oct 20, 2010 at 09:48:43PM -0400, tm@ wrote: > On Wed, Oct 20, 2010 at 07:26:45PM +0200, Peter Zijlstra wrote:
> > OK, how does this work for people? I find my idle load is still a tad > > high, but maybe I'm not patient enough.
> I haven't had a chance to keep up with the topic, and I apologize. I'll be > testing this as soon as I can finish compiling it. Thank you all for not > letting this go unfixed.
> Tim McGrath
Uhh, problem. This patch does not apply to git checkout 74f5187ac873042f502227701ed1727e7c5fbfa9
which is the version of the kernel that first exhibits this flaw.
which version of the kernel does this patch apply cleanly to?
* tmhik...@gmail.com <tmhik...@gmail.com> wrote: > On Wed, Oct 20, 2010 at 09:48:43PM -0400, tm@ wrote: > > On Wed, Oct 20, 2010 at 07:26:45PM +0200, Peter Zijlstra wrote:
> > > OK, how does this work for people? I find my idle load is still a tad > > > high, but maybe I'm not patient enough.
> > I haven't had a chance to keep up with the topic, and I apologize. I'll be > > testing this as soon as I can finish compiling it. Thank you all for not > > letting this go unfixed.
> > Tim McGrath
> Uhh, problem. This patch does not apply to git checkout > 74f5187ac873042f502227701ed1727e7c5fbfa9
> which is the version of the kernel that first exhibits this flaw.
> which version of the kernel does this patch apply cleanly to?
Try -tip (which includes the scheduler development tree as well):
> > On Wed, Oct 20, 2010 at 09:48:43PM -0400, tm@ wrote: > > > On Wed, Oct 20, 2010 at 07:26:45PM +0200, Peter Zijlstra wrote:
> > > > OK, how does this work for people? I find my idle load is still a tad > > > > high, but maybe I'm not patient enough.
> > > I haven't had a chance to keep up with the topic, and I apologize. I'll be > > > testing this as soon as I can finish compiling it. Thank you all for not > > > letting this go unfixed.
> > > Tim McGrath
> > Uhh, problem. This patch does not apply to git checkout > > 74f5187ac873042f502227701ed1727e7c5fbfa9
> > which is the version of the kernel that first exhibits this flaw.
> > which version of the kernel does this patch apply cleanly to?
> Try -tip (which includes the scheduler development tree as well):
Tried that, patch still doesn't apply... and I just figured out why. Looks like my email client is screwing the patch up. mutt apparently wants to chew on my mail before I get it. viewing the mail as an attachment and saving it works properly however.
Now that I have properly saved the mail, it applies cleanly to tip/master as well as 74f5187ac873042f502227701ed1727e7c5fbfa9 - though in the latter's case it's having to fuzz around a bit. I'll try testing 74f5187ac873042f502227701ed1727e7c5fbfa9 first since it's the one I *know* is flawed, and I want to reduce the amount of changes that I have to test for.
I'll build and test it, then let you guys know if there's any noticable difference.
> +void calc_load_account_nonidle(void) > +{ > + struct rq *this_rq = this_rq(); > + > + if (atomic_read(&calc_load_seq) == this_rq->calc_load_seq) { > + atomic_long_sub(this_rq->calc_load_inactive, &calc_load_tasks_idle); > + /* > + * Undo the _fold_active() from _account_idle(). This > + * avoids us loosing active tasks and creating a negative > + * bias > + */ > + this_rq->calc_load_active -= this_rq->calc_load_inactive; > + } > +}
Ok, so while trying to write a changelog on this patch I got myself terribly confused again..
calc_load_active_fold() is a relative operation and simply gives delta values since the last time it got called. That means that the sum of multiple invocations in a given time interval should be identical to a single invocation.
Therefore, the going idle multiple times during LOAD_FREQ hypothesis doesn't really make sense.
Even if it became idle but wasn't idle at the LOAD_FREQ turn-over it shouldn't matter, since the calc_load_account_active() call will simply fold the remaining delta with the accrued idle delta and the total should all match up once we fold into the global calc_load_tasks.
So afaict its should all have worked and this patch is a big NOP,. except it isn't..
Damn I hate this bug.. ;-) Anybody? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
>> +void calc_load_account_nonidle(void) >> +{ >> + struct rq *this_rq = this_rq(); >> + >> + if (atomic_read(&calc_load_seq) == this_rq->calc_load_seq) { >> + atomic_long_sub(this_rq->calc_load_inactive, &calc_load_tasks_idle); >> + /* >> + * Undo the _fold_active() from _account_idle(). This >> + * avoids us loosing active tasks and creating a negative >> + * bias >> + */ >> + this_rq->calc_load_active -= this_rq->calc_load_inactive; >> + } >> +}
> Ok, so while trying to write a changelog on this patch I got myself > terribly confused again..
> calc_load_active_fold() is a relative operation and simply gives delta > values since the last time it got called. That means that the sum of > multiple invocations in a given time interval should be identical to a > single invocation.
> Therefore, the going idle multiple times during LOAD_FREQ hypothesis > doesn't really make sense.
Yes. Thats what I was thinking trying to understand this code yesterday.
Also with sequence number I don't think nr_interruptible would be handled correctly as tasks can move to CPU after it first went idle and may not get accounted later.
I somehow feel the problem is with nr_interruptible, which gets accounted multiple times on idle tasks and only once per LOAD_FREQ on busy tasks. However, things are not fully clear to me yet. Have to look at the code a bit more.
Thanks, Venki
> Even if it became idle but wasn't idle at the LOAD_FREQ turn-over it > shouldn't matter, since the calc_load_account_active() call will simply > fold the remaining delta with the accrued idle delta and the total > should all match up once we fold into the global calc_load_tasks.
> So afaict its should all have worked and this patch is a big NOP,. > except it isn't..
On Wed, Oct 20, 2010 at 09:48:43PM -0400, tm@ wrote: > On Wed, Oct 20, 2010 at 07:26:45PM +0200, Peter Zijlstra wrote:
> > OK, how does this work for people? I find my idle load is still a tad > > high, but maybe I'm not patient enough.
> I haven't had a chance to keep up with the topic, and I apologize. I'll be > testing this as soon as I can finish compiling it. Thank you all for not > letting this go unfixed.
> Tim McGrath
Now that I've actually had a chance to boot the kernel with the patch applied I'm sorry to say but the load average isn't decaying as fast as it ought to, at the very least. My machine's been idle for the last ten minutes but the one minute average is still at 0.89 and shooting up to 1.5, the 5 min average is 0.9, and the 15 min average is .68 and climbing. Even as I'm writing this the averages are continuing to drop, but *very* slowly. Glacially, almost. The one minute average is continuing to randomly spike high for no reason I can tell as well.
I'll let you guys know if this actually bottoms out at some point.
On Thu, Oct 21, 2010 at 02:36:21PM -0400, tm@ wrote: > On Wed, Oct 20, 2010 at 09:48:43PM -0400, tm@ wrote: > > On Wed, Oct 20, 2010 at 07:26:45PM +0200, Peter Zijlstra wrote:
> > > OK, how does this work for people? I find my idle load is still a tad > > > high, but maybe I'm not patient enough.
> > I haven't had a chance to keep up with the topic, and I apologize. I'll be > > testing this as soon as I can finish compiling it. Thank you all for not > > letting this go unfixed.
> > Tim McGrath
> Now that I've actually had a chance to boot the kernel with the patch > applied I'm sorry to say but the load average isn't decaying as fast as it > ought to, at the very least. My machine's been idle for the last ten minutes > but the one minute average is still at 0.89 and shooting up to 1.5, the 5 > min average is 0.9, and the 15 min average is .68 and climbing. Even as I'm > writing this the averages are continuing to drop, but *very* slowly. > Glacially, almost. The one minute average is continuing to randomly spike > high for no reason I can tell as well.
> I'll let you guys know if this actually bottoms out at some point.
> Tim McGrath
It did not. When I came home and checked, my load average was a steady 0.7-0.8 across the board on all averages with the machine idle since six hours ago. I guess the patch didn't fix the problem for me. If you want, I'll try building master/tip with the patch applied, but I doubt it'll really be different.
On the plus side, the patch did do something - it seems much less erratic than it used to be for whatever reason, and now just has a very steady load average rather than jumping about as it does without the patch applied.
I wish I understood the code enough to know what is going wrong here. I have to wonder what impact the original bug was causing. It seems to me that if it only affected a few people it might be worth backing out the patch and rethinking the problem it was meant to fix. On the other hand, if there's some way of diagnosing the problem I'm all for it - is there some kprintfs or something I could put in the code to find out when it's doing unlikely or 'impossible' things? There is serious weirdness going on here and I'd like to figure out the cause of it. I get the impression we're all bumbling about in the dark poking at a gigantic elephant and getting the wrong impressions.
I started making small changes to the code, but none of the change helped much. I think the problem with the current code is that, even though idle CPUs update load, the fold only happens when one of the CPU is busy and we end up taking its load into global load.
So, I tried to simplify things and doing the updates directly from idle loop. This is only a test patch, and eventually we need to hook it off somewhere else, instead of idle loop and also this is expected work only as x86_64 right now.
Peter: Do you think something like this will work? loadavg went quite on two of my test systems after this change (4 cpu and 24 cpu).
+void idle_load_update(void); /* * The idle thread. There's no useful work to be * done, so just try to conserve power and have a @@ -140,6 +141,7 @@ void cpu_idle(void) stop_critical_timings(); pm_idle(); start_critical_timings(); + idle_load_update();
if (nr_active != this_rq->calc_load_active) { @@ -2974,46 +2974,6 @@ static long calc_load_fold_active(struct rq *this_rq) return delta; }
-#ifdef CONFIG_NO_HZ -/* - * For NO_HZ we delay the active fold to the next LOAD_FREQ update. - * - * When making the ILB scale, we should try to pull this in as well. - */ -static atomic_long_t calc_load_tasks_idle; - -static void calc_load_account_idle(struct rq *this_rq) -{ - long delta; - - delta = calc_load_fold_active(this_rq); - if (delta) - atomic_long_add(delta, &calc_load_tasks_idle); -} - -static long calc_load_fold_idle(void) -{ - long delta = 0; - - /* - * Its got a race, we don't care... - */ - if (atomic_long_read(&calc_load_tasks_idle)) - delta = atomic_long_xchg(&calc_load_tasks_idle, 0); - - return delta; -} -#else -static void calc_load_account_idle(struct rq *this_rq) -{ -} - -static inline long calc_load_fold_idle(void) -{ - return 0; -} -#endif - /** * get_avenrun - get the load average array * @loads: pointer to dest load array @@ -3043,7 +3003,7 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active) */ void calc_global_load(void) { - unsigned long upd = calc_load_update + 10; + unsigned long upd = calc_load_update + LOAD_FREQ/2; long active;
if (time_before(jiffies, upd)) @@ -3063,21 +3023,30 @@ void calc_global_load(void) * Called from update_cpu_load() to periodically update this CPU's * active count. */ -static void calc_load_account_active(struct rq *this_rq) +static void calc_load_account(struct rq *this_rq, int idle) { long delta;
if (time_before(jiffies, this_rq->calc_load_update)) return;
Discussion subject changed to "High CPU load when machine is idle (related to PROBLEM: Unusually high load average when idle in 2.6.35, 2.6.35.1 and later)" by Venkatesh Pallipadi
I started making small changes to the code, but none of the change helped much. I think the problem with the current code is that, even though idle CPUs update load, the fold only happens when one of the CPU is busy and we end up taking its load into global load.
So, I tried to simplify things and doing the updates directly from idle loop. This is only a test patch, and eventually we need to hook it off somewhere else, instead of idle loop and also this is expected work only as x86_64 right now.
Peter: Do you think something like this will work? loadavg went quite on two of my test systems after this change (4 cpu and 24 cpu).
+void idle_load_update(void); /* * The idle thread. There's no useful work to be * done, so just try to conserve power and have a @@ -140,6 +141,7 @@ void cpu_idle(void) stop_critical_timings(); pm_idle(); start_critical_timings(); + idle_load_update();
if (nr_active != this_rq->calc_load_active) { @@ -2974,46 +2974,6 @@ static long calc_load_fold_active(struct rq *this_rq) return delta; }
-#ifdef CONFIG_NO_HZ -/* - * For NO_HZ we delay the active fold to the next LOAD_FREQ update. - * - * When making the ILB scale, we should try to pull this in as well. - */ -static atomic_long_t calc_load_tasks_idle; - -static void calc_load_account_idle(struct rq *this_rq) -{ - long delta; - - delta = calc_load_fold_active(this_rq); - if (delta) - atomic_long_add(delta, &calc_load_tasks_idle); -} - -static long calc_load_fold_idle(void) -{ - long delta = 0; - - /* - * Its got a race, we don't care... - */ - if (atomic_long_read(&calc_load_tasks_idle)) - delta = atomic_long_xchg(&calc_load_tasks_idle, 0); - - return delta; -} -#else -static void calc_load_account_idle(struct rq *this_rq) -{ -} - -static inline long calc_load_fold_idle(void) -{ - return 0; -} -#endif - /** * get_avenrun - get the load average array * @loads: pointer to dest load array @@ -3043,7 +3003,7 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active) */ void calc_global_load(void) { - unsigned long upd = calc_load_update + 10; + unsigned long upd = calc_load_update + LOAD_FREQ/2; long active;
if (time_before(jiffies, upd)) @@ -3063,21 +3023,30 @@ void calc_global_load(void) * Called from update_cpu_load() to periodically update this CPU's * active count. */ -static void calc_load_account_active(struct rq *this_rq) +static void calc_load_account(struct rq *this_rq, int idle) { long delta;
if (time_before(jiffies, this_rq->calc_load_update)) return;
On Fri, Oct 22, 2010 at 04:03:42PM -0700, Venkatesh Pallipadi wrote: > (Sorry about the subjectless earlier mail)
> I started making small changes to the code, but none of the change helped much. > I think the problem with the current code is that, even though idle CPUs > update load, the fold only happens when one of the CPU is busy > and we end up taking its load into global load.
> So, I tried to simplify things and doing the updates directly from idle loop. > This is only a test patch, and eventually we need to hook it off somewhere > else, instead of idle loop and also this is expected work only as x86_64 > right now.
> Peter: Do you think something like this will work? loadavg went > quite on two of my test systems after this change (4 cpu and 24 cpu).
> Thanks, > Venki
I'd really like to be able to help test this, but with a 32bit x86 machine I guess this patch won't do anything for me. How hard would it be to mangle this into working for my machine or should I just wait?
On Mon, Oct 25, 2010 at 3:12 AM, Peter Zijlstra <pet...@infradead.org> wrote: > On Fri, 2010-10-22 at 16:03 -0700, Venkatesh Pallipadi wrote: >> I started making small changes to the code, but none of the change helped much. >> I think the problem with the current code is that, even though idle CPUs >> update load, the fold only happens when one of the CPU is busy >> and we end up taking its load into global load.
>> So, I tried to simplify things and doing the updates directly from idle loop. >> This is only a test patch, and eventually we need to hook it off somewhere >> else, instead of idle loop and also this is expected work only as x86_64 >> right now.
>> Peter: Do you think something like this will work? loadavg went >> quite on two of my test systems after this change (4 cpu and 24 cpu).
> Not really, CPUs can stay idle for _very_ long times (!x86 cpus that > don't have crappy timers like HPET which roll around every 2-4 seconds).
> But all CPUs staying idle for a long time is exactly the scenario you > fix before using the decay_load_misses() stuff, except that is for the > load-balancer per-cpu load numbers not the global cpu load avg. Won't a > similar approach work here?
Yes. Thought about that. One problem there is that works with nohz_idle_balance, which will not be called if all the CPUs are idle for example. As this is once in 5 secs, probably doing nr_running() and nr_uninterruptible() should be OK even on huge systems. But, that was the original code here, except that it was inside xtime_lock.