Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

nohz problem with idle time on old hardware

83 views
Skip to first unread message

Matthew Whitehead

unread,
Nov 13, 2013, 6:40:01 AM11/13/13
to
I was testing the 3.12 kernel on some _old_ hardware and I uncovered a bug.
It arises when nohz=on and goes away with nohz=off. On a crusty dual Pentium-1
system that is completely idle, the sar utility reports 0% idle time on cpu0
and 100% idle on cpu1. Cpu0 _should_ also be reporting 100% idle, but instead
it reports around 75% system time and 25% user time.

The problem was diagnosed by Steve Rostedt with help from John Stultz. The old
system declares the dual TSCs unstable, and backs down to a timesource of
refined-jiffies. Apparently refined-jiffies and jiffies are not a usable
timesourcefor nohz, but we don't check for that case because most modern
systems have several reliable hardware timesources.

John suggested that we turn off nohz unless a usable hardware timesource is
present.

- Matthew Whitehead <tedhe...@gmail.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Thomas Gleixner

unread,
Nov 13, 2013, 9:10:02 AM11/13/13
to
On Wed, 13 Nov 2013, Matthew Whitehead wrote:

> I was testing the 3.12 kernel on some _old_ hardware and I uncovered a bug.
> It arises when nohz=on and goes away with nohz=off. On a crusty dual Pentium-1
> system that is completely idle, the sar utility reports 0% idle time on cpu0
> and 100% idle on cpu1. Cpu0 _should_ also be reporting 100% idle, but instead
> it reports around 75% system time and 25% user time.
>
> The problem was diagnosed by Steve Rostedt with help from John
> Stultz. The old system declares the dual TSCs unstable, and backs
> down to a timesource of refined-jiffies. Apparently refined-jiffies
> and jiffies are not a usable timesourcefor nohz, but we don't check
> for that case because most modern systems have several reliable
> hardware timesources.

Wrong.

> John suggested that we turn off nohz unless a usable hardware timesource is
> present.

nohz already depends on two things:

1) A reliable clocksource which is valid for highres/nohz

2) A per cpu clockevent device which supports one shot mode.

and those are evaluated at runtime before we switch into NOHZ mode.

And neither jiffies nor refined-jiffies qualify as valid clocksource.

So there is something else wrong.

Thanks,

tglx

Steven Rostedt

unread,
Nov 13, 2013, 10:40:02 AM11/13/13
to
On Wed, 13 Nov 2013 10:21:53 -0500
Steven Rostedt <ros...@goodmis.org> wrote:
trace-cmd record -p function -l '*nohz*' -l account_process_tick -e sched_switch
>
> rcu_sche-9 0d... 6858.618033: sched_switch: rcu_sched:9 [120] S ==> swapper/0:0 [120]
> <idle>-0 0.... 6858.618082: function: tick_nohz_idle_enter <-- cpu_startup_entry
> <idle>-0 0d... 6858.618101: function: __tick_nohz_idle_enter <-- tick_nohz_idle_enter
> <idle>-0 0d.s. 6858.621499: function: tick_nohz_stop_idle <-- tick_check_idle
> <idle>-0 0d.h. 6858.621550: function: account_process_tick <-- update_process_times
> <idle>-0 0d... 6858.621769: function: tick_nohz_irq_exit <-- irq_exit
> <idle>-0 0d... 6858.621787: function: __tick_nohz_idle_enter <-- irq_exit
> <idle>-0 0d.s. 6858.625500: function: tick_nohz_stop_idle <-- tick_check_idle
> <idle>-0 0d.h. 6858.625574: function: account_process_tick <-- update_process_times
> <idle>-0 0d... 6858.625818: function: tick_nohz_irq_exit <-- irq_exit
> <idle>-0 0d... 6858.625847: function: __tick_nohz_idle_enter <-- irq_exit
> <idle>-0 0d.s. 6858.629295: function: tick_nohz_stop_idle <-- tick_check_idle
> <idle>-0 0d... 6858.629351: function: tick_nohz_irq_exit <-- irq_exit
> <idle>-0 0d... 6858.629373: function: __tick_nohz_idle_enter <-- irq_exit
> <idle>-0 0d.s. 6858.629503: function: tick_nohz_stop_idle <-- tick_check_idle
> <idle>-0 0d.h. 6858.629569: function: account_process_tick <-- update_process_times
> <idle>-0 0d... 6858.629851: function: tick_nohz_irq_exit <-- irq_exit
> <idle>-0 0d... 6858.629881: function: __tick_nohz_idle_enter <-- irq_exit
> <idle>-0 0d.s. 6858.630412: function: tick_nohz_stop_idle <-- tick_check_idle
> <idle>-0 0.N.. 6858.630550: function: tick_nohz_idle_exit <-- cpu_startup_entry
> <idle>-0 0d... 6858.630605: sched_switch: swapper/0:0 [120] R ==> rcu_sched:9 [120]
>
> I'm not saying that we are actually getting into nohz, but something
> with the nohz code is messing with cpu accounting.

The trace does indeed show that a tick is happening, as the config has
HZ=250 (4ms) and we see a tick happen every 4ms. But for some reason,
we don't update the the idle time correctly when nohz is enabled.

When I say nohz is enabled, I mean that we don't have nohz=off in the
command line. There seems to be some difference between having nohz=off
and having nohz disabled at runtime.

-- Steve

Thomas Gleixner

unread,
Nov 13, 2013, 11:00:01 AM11/13/13
to
On Wed, 13 Nov 2013, Steven Rostedt wrote:
> > I'm not saying that we are actually getting into nohz, but something
> > with the nohz code is messing with cpu accounting.
>
> The trace does indeed show that a tick is happening, as the config has
> HZ=250 (4ms) and we see a tick happen every 4ms. But for some reason,
> we don't update the the idle time correctly when nohz is enabled.
>
> When I say nohz is enabled, I mean that we don't have nohz=off in the
> command line. There seems to be some difference between having nohz=off
> and having nohz disabled at runtime.

Right that affects tick_nohz_enabled

Two files use this variable:
kernel/rcu/tree_plugin.h
kernel/time/tick-sched.c

The only accounting related stuff is in tick-sched.c:

get_cpu_idle_time_us() and get_cpu_iowait_time_us()

Both functions bail out if (!tick_nohz_enabled).

The users of get_cpu_idle_time_us() are cpufreq and fs/proc/stat.c!

Now the simplest fix is to let those functions check whether we
actually switched into NOHZ mode. Should work for the RCU tree stuff
as well.

Thanks,

tglx

Steven Rostedt

unread,
Nov 13, 2013, 11:00:02 AM11/13/13
to
On Wed, 13 Nov 2013 10:31:34 -0500
Steven Rostedt <ros...@goodmis.org> wrote:

> The trace does indeed show that a tick is happening, as the config has
> HZ=250 (4ms) and we see a tick happen every 4ms. But for some reason,
> we don't update the the idle time correctly when nohz is enabled.
>
> When I say nohz is enabled, I mean that we don't have nohz=off in the
> command line. There seems to be some difference between having nohz=off
> and having nohz disabled at runtime.


Looking at the differences between nohz=off from the command line, and
disabled at run time seems to be the variable "tick_nohz_enabled". I
don't see where it gets set to zero except for nohz=off.

$ git grep tick_nohz_enabled
kernel/rcutree.h: int tick_nohz_enabled_snap; /* Previously seen
value from sysfs. */ kernel/rcutree_plugin.h:extern int
tick_nohz_enabled; kernel/rcutree_plugin.h: tne =
ACCESS_ONCE(tick_nohz_enabled); kernel/rcutree_plugin.h: if
(tne != rdtp->tick_nohz_enabled_snap)
{ kernel/rcutree_plugin.h: rdtp->tick_nohz_enabled_snap
= tne; kernel/rcutree_plugin.h:
rdtp->tick_nohz_enabled_snap ? '.' : 'D'); kernel/time/tick-sched.c:int
tick_nohz_enabled __read_mostly = 1;
kernel/time/tick-sched.c: tick_nohz_enabled = 0;
kernel/time/tick-sched.c: tick_nohz_enabled = 1;
kernel/time/tick-sched.c: if (!tick_nohz_enabled)
kernel/time/tick-sched.c: if (!tick_nohz_enabled)
kernel/time/tick-sched.c: if (!tick_nohz_enabled)
kernel/time/tick-sched.c: if (tick_nohz_enabled)


What's even stranger is that the RCU code in rcutree_plugin.h does an
ACCESS_ONCE(tick_nohz_enabled) as if it can change.

That said, looking at the fs/proc/stat.c get_idle_time() it does an
idle_time = get_cpu_idle_time_us(cpu, NULL) which has:

u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
ktime_t now, idle;

if (!tick_nohz_enabled)
return -1;

now = ktime_get();
if (last_update_time) {
update_ts_time_stats(cpu, ts, now, last_update_time);
idle = ts->idle_sleeptime;
} else {
if (ts->idle_active && !nr_iowait_cpu(cpu)) {
ktime_t delta = ktime_sub(now, ts->idle_entrytime);
idle = ktime_add(ts->idle_sleeptime, delta);
} else {
idle = ts->idle_sleeptime;
}
}

return ktime_to_us(idle);

}

This is one of the differences between nohz=off and nohz=on with jiffy
accounting. When we have nohz=off, this returns -1 and the calling
function calculates the idle time differently.

Thomas Gleixner

unread,
Nov 13, 2013, 11:10:02 AM11/13/13
to
On Wed, 13 Nov 2013, Steven Rostedt wrote:

> On Wed, 13 Nov 2013 10:31:34 -0500
> Steven Rostedt <ros...@goodmis.org> wrote:
>
> > The trace does indeed show that a tick is happening, as the config has
> > HZ=250 (4ms) and we see a tick happen every 4ms. But for some reason,
> > we don't update the the idle time correctly when nohz is enabled.
> >
> > When I say nohz is enabled, I mean that we don't have nohz=off in the
> > command line. There seems to be some difference between having nohz=off
> > and having nohz disabled at runtime.
>
>
> Looking at the differences between nohz=off from the command line, and
> disabled at run time seems to be the variable "tick_nohz_enabled". I
> don't see where it gets set to zero except for nohz=off.

Right. It's telling you if NOHZ is enabled. It's not telling you that
NOHZ is active.

Thanks,

tglx

Steven Rostedt

unread,
Nov 13, 2013, 11:20:02 AM11/13/13
to
On Wed, 13 Nov 2013 17:07:18 +0100 (CET)
Thomas Gleixner <tg...@linutronix.de> wrote:


> Right. It's telling you if NOHZ is enabled. It's not telling you that
> NOHZ is active.

Yeah, which makes this code rather silly:

in rcu_prepare_for_idle():

/* Handle nohz enablement switches conservatively. */
tne = ACCESS_ONCE(tick_nohz_enabled);
if (tne != rdtp->tick_nohz_enabled_snap) {
if (rcu_cpu_has_callbacks(cpu, NULL))
invoke_rcu_core(); /* force nohz to see update. */
rdtp->tick_nohz_enabled_snap = tne;
return;
}

-- Steve

Paul E. McKenney

unread,
Nov 13, 2013, 11:20:02 AM11/13/13
to
On Wed, Nov 13, 2013 at 04:50:20PM +0100, Thomas Gleixner wrote:
> On Wed, 13 Nov 2013, Steven Rostedt wrote:
> > > I'm not saying that we are actually getting into nohz, but something
> > > with the nohz code is messing with cpu accounting.
> >
> > The trace does indeed show that a tick is happening, as the config has
> > HZ=250 (4ms) and we see a tick happen every 4ms. But for some reason,
> > we don't update the the idle time correctly when nohz is enabled.
> >
> > When I say nohz is enabled, I mean that we don't have nohz=off in the
> > command line. There seems to be some difference between having nohz=off
> > and having nohz disabled at runtime.
>
> Right that affects tick_nohz_enabled
>
> Two files use this variable:
> kernel/rcu/tree_plugin.h
> kernel/time/tick-sched.c
>
> The only accounting related stuff is in tick-sched.c:
>
> get_cpu_idle_time_us() and get_cpu_iowait_time_us()
>
> Both functions bail out if (!tick_nohz_enabled).
>
> The users of get_cpu_idle_time_us() are cpufreq and fs/proc/stat.c!
>
> Now the simplest fix is to let those functions check whether we
> actually switched into NOHZ mode. Should work for the RCU tree stuff
> as well.

RCU's use of tick_nohz_enabled is for the RCU_FAST_NO_HZ stuff. If
it sees !tick_nohz_enabled, it skips trying to get RCU out of the way
of disabling the scheduling-clock tick. If RCU detects a change
in the value of tick_nohz_enabled, it does a raise_softirq() to
force re-evaluation of the situation.

Thanx, Paul

Paul E. McKenney

unread,
Nov 13, 2013, 11:20:02 AM11/13/13
to
On Wed, Nov 13, 2013 at 11:12:57AM -0500, Steven Rostedt wrote:
> On Wed, 13 Nov 2013 17:07:18 +0100 (CET)
> Thomas Gleixner <tg...@linutronix.de> wrote:
>
>
> > Right. It's telling you if NOHZ is enabled. It's not telling you that
> > NOHZ is active.
>
> Yeah, which makes this code rather silly:
>
> in rcu_prepare_for_idle():
>
> /* Handle nohz enablement switches conservatively. */
> tne = ACCESS_ONCE(tick_nohz_enabled);
> if (tne != rdtp->tick_nohz_enabled_snap) {
> if (rcu_cpu_has_callbacks(cpu, NULL))
> invoke_rcu_core(); /* force nohz to see update. */
> rdtp->tick_nohz_enabled_snap = tne;
> return;
> }

OK, what should I be checking instead? Not much point in trying to
get RCU out of the way of disabling the scheduling-clock interrupt
if NOHZ is disabled. ;-)

Thanx, Paul

Thomas Gleixner

unread,
Nov 13, 2013, 11:30:02 AM11/13/13
to
On Wed, 13 Nov 2013, Thomas Gleixner wrote:

> On Wed, 13 Nov 2013, Steven Rostedt wrote:
>
> > On Wed, 13 Nov 2013 10:31:34 -0500
> > Steven Rostedt <ros...@goodmis.org> wrote:
> >
> > > The trace does indeed show that a tick is happening, as the config has
> > > HZ=250 (4ms) and we see a tick happen every 4ms. But for some reason,
> > > we don't update the the idle time correctly when nohz is enabled.
> > >
> > > When I say nohz is enabled, I mean that we don't have nohz=off in the
> > > command line. There seems to be some difference between having nohz=off
> > > and having nohz disabled at runtime.
> >
> >
> > Looking at the differences between nohz=off from the command line, and
> > disabled at run time seems to be the variable "tick_nohz_enabled". I
> > don't see where it gets set to zero except for nohz=off.
>
> Right. It's telling you if NOHZ is enabled. It's not telling you that
> NOHZ is active.

And its even worse:

tick_nohz_idle_enter()

/*
* set ts->inidle unconditionally. even if the system did not
* switch to nohz mode the cpu frequency governers rely on the
* update of the idle time accounting in tick_nohz_start_idle().
*/
ts->inidle = 1;

So there is code relying on that accounting stuff even if nohz is not
active.

Frozen shark time ....

Steven Rostedt

unread,
Nov 13, 2013, 11:30:02 AM11/13/13
to
On Wed, 13 Nov 2013 08:18:29 -0800
"Paul E. McKenney" <pau...@linux.vnet.ibm.com> wrote:

> On Wed, Nov 13, 2013 at 11:12:57AM -0500, Steven Rostedt wrote:
> > On Wed, 13 Nov 2013 17:07:18 +0100 (CET)
> > Thomas Gleixner <tg...@linutronix.de> wrote:
> >
> >
> > > Right. It's telling you if NOHZ is enabled. It's not telling you that
> > > NOHZ is active.
> >
> > Yeah, which makes this code rather silly:
> >
> > in rcu_prepare_for_idle():
> >
> > /* Handle nohz enablement switches conservatively. */
> > tne = ACCESS_ONCE(tick_nohz_enabled);
> > if (tne != rdtp->tick_nohz_enabled_snap) {
> > if (rcu_cpu_has_callbacks(cpu, NULL))
> > invoke_rcu_core(); /* force nohz to see update. */
> > rdtp->tick_nohz_enabled_snap = tne;
> > return;
> > }
>
> OK, what should I be checking instead? Not much point in trying to
> get RCU out of the way of disabling the scheduling-clock interrupt
> if NOHZ is disabled. ;-)
>

I'll leave the answer to Thomas, but checking tick_nohz_enabled just
lets you know if someone booted with nohz=off or not (and has nohz
configured). But it doesn't tell you if nohz is actually being used.

That is, tick_nohz_enabled is set at bootup and never changes.

Perhaps this old hardware uncovered other bugs as well ;-)

-- Steve

Paul E. McKenney

unread,
Nov 13, 2013, 11:40:01 AM11/13/13
to
On Wed, Nov 13, 2013 at 11:23:38AM -0500, Steven Rostedt wrote:
> On Wed, 13 Nov 2013 08:18:29 -0800
> "Paul E. McKenney" <pau...@linux.vnet.ibm.com> wrote:
>
> > On Wed, Nov 13, 2013 at 11:12:57AM -0500, Steven Rostedt wrote:
> > > On Wed, 13 Nov 2013 17:07:18 +0100 (CET)
> > > Thomas Gleixner <tg...@linutronix.de> wrote:
> > >
> > >
> > > > Right. It's telling you if NOHZ is enabled. It's not telling you that
> > > > NOHZ is active.
> > >
> > > Yeah, which makes this code rather silly:
> > >
> > > in rcu_prepare_for_idle():
> > >
> > > /* Handle nohz enablement switches conservatively. */
> > > tne = ACCESS_ONCE(tick_nohz_enabled);
> > > if (tne != rdtp->tick_nohz_enabled_snap) {
> > > if (rcu_cpu_has_callbacks(cpu, NULL))
> > > invoke_rcu_core(); /* force nohz to see update. */
> > > rdtp->tick_nohz_enabled_snap = tne;
> > > return;
> > > }
> >
> > OK, what should I be checking instead? Not much point in trying to
> > get RCU out of the way of disabling the scheduling-clock interrupt
> > if NOHZ is disabled. ;-)
>
> I'll leave the answer to Thomas, but checking tick_nohz_enabled just
> lets you know if someone booted with nohz=off or not (and has nohz
> configured). But it doesn't tell you if nohz is actually being used.

Based on Thomas's most recent response, it sounds like I need to check
a frozen shark or something. ;-)

Thanx, Paul

Steven Rostedt

unread,
Nov 13, 2013, 11:40:02 AM11/13/13
to
On Wed, 13 Nov 2013 17:21:53 +0100 (CET)
Thomas Gleixner <tg...@linutronix.de> wrote:


> Frozen shark time ....

As long as it's not aimed at me ;-)

-- Steve

Thomas Gleixner

unread,
Nov 13, 2013, 3:10:01 PM11/13/13
to
Correct. Fix for the whole mess below.

Thanks,

tglx

---------------->
Subject: NOHZ: Check for nohz active instead of nohz enabled

RCU and the fine grained idle time accounting functions check
tick_nohz_enabled. But that variable is merily telling that NOHZ has
been enabled in the config and not been disabled on the command line.

But it does not tell anything about nohz being active. That's what all
this should check for.

Matthew reported, that the idle accounting on his old P1 machine
showed bogus values, when he enabled NOHZ in the config and did not
disable it on the kernel command line. The reason is that his machine
uses (refined) jiffies as a clocksource which explains why the "fine"
grained accounting went into lala land, because it depends on when the
system goes and leaves idle relative to the jiffies increment.

Provide a tick_nohz_active indicator and let RCU and the accounting
code use this instead of tick_nohz_enable.

Reported-by: Matthew Whitehead <tedhe...@gmail.com>
Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
---
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 3822ac0..da6e6de 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1632,7 +1632,7 @@ module_param(rcu_idle_gp_delay, int, 0644);
static int rcu_idle_lazy_gp_delay = RCU_IDLE_LAZY_GP_DELAY;
module_param(rcu_idle_lazy_gp_delay, int, 0644);

-extern int tick_nohz_enabled;
+extern int tick_nohz_active;

/*
* Try to advance callbacks for all flavors of RCU on the current CPU, but
@@ -1729,7 +1729,7 @@ static void rcu_prepare_for_idle(int cpu)
int tne;

/* Handle nohz enablement switches conservatively. */
- tne = ACCESS_ONCE(tick_nohz_enabled);
+ tne = ACCESS_ONCE(tick_nohz_active);
if (tne != rdtp->tick_nohz_enabled_snap) {
if (rcu_cpu_has_callbacks(cpu, NULL))
invoke_rcu_core(); /* force nohz to see update. */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3612fc7..a12df5a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -361,8 +361,8 @@ void __init tick_nohz_init(void)
/*
* NO HZ enabled ?
*/
-int tick_nohz_enabled __read_mostly = 1;
-
+static int tick_nohz_enabled __read_mostly = 1;
+int tick_nohz_active __read_mostly;
/*
* Enable / Disable tickless mode
*/
@@ -465,7 +465,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
ktime_t now, idle;

- if (!tick_nohz_enabled)
+ if (!tick_nohz_active)
return -1;

now = ktime_get();
@@ -506,7 +506,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
ktime_t now, iowait;

- if (!tick_nohz_enabled)
+ if (!tick_nohz_active)
return -1;

now = ktime_get();
@@ -799,11 +799,6 @@ void tick_nohz_idle_enter(void)
local_irq_disable();

ts = &__get_cpu_var(tick_cpu_sched);
- /*
- * set ts->inidle unconditionally. even if the system did not
- * switch to nohz mode the cpu frequency governers rely on the
- * update of the idle time accounting in tick_nohz_start_idle().
- */
ts->inidle = 1;
__tick_nohz_idle_enter(ts);

@@ -973,7 +968,7 @@ static void tick_nohz_switch_to_nohz(void)
struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
ktime_t next;

- if (!tick_nohz_enabled)
+ if (!tick_nohz_active)
return;

local_irq_disable();
@@ -981,7 +976,7 @@ static void tick_nohz_switch_to_nohz(void)
local_irq_enable();
return;
}
-
+ tick_nohz_active = 1;
ts->nohz_mode = NOHZ_MODE_LOWRES;

/*
@@ -1139,8 +1134,10 @@ void tick_setup_sched_timer(void)
}

#ifdef CONFIG_NO_HZ_COMMON
- if (tick_nohz_enabled)
+ if (tick_nohz_enabled) {
ts->nohz_mode = NOHZ_MODE_HIGHRES;
+ tick_nohz_active = 1;
+ }
#endif
}
#endif /* HIGH_RES_TIMERS */

Steven Rostedt

unread,
Nov 13, 2013, 3:10:01 PM11/13/13
to
On Wed, 13 Nov 2013 21:01:57 +0100 (CET)
Thomas Gleixner <tg...@linutronix.de> wrote:


> ---------------->
> Subject: NOHZ: Check for nohz active instead of nohz enabled
>
> RCU and the fine grained idle time accounting functions check
> tick_nohz_enabled. But that variable is merily telling that NOHZ has
> been enabled in the config and not been disabled on the command line.
>
> But it does not tell anything about nohz being active. That's what all
> this should check for.
>
> Matthew reported, that the idle accounting on his old P1 machine
> showed bogus values, when he enabled NOHZ in the config and did not
> disable it on the kernel command line. The reason is that his machine
> uses (refined) jiffies as a clocksource which explains why the "fine"
> grained accounting went into lala land, because it depends on when the
> system goes and leaves idle relative to the jiffies increment.
>
> Provide a tick_nohz_active indicator and let RCU and the accounting
> code use this instead of tick_nohz_enable.
>
> Reported-by: Matthew Whitehead <tedhe...@gmail.com>

Matthew, can you test this patch to make sure it causes the idle issue
to go away (remember to remove nohz=off from your command line).

Reviewed-by: Steven Rostedt <ros...@goodmis.org>

-- Steve

Matthew Whitehead

unread,
Nov 13, 2013, 5:00:03 PM11/13/13
to
On Wed, Nov 13, 2013 at 03:07:45PM -0500, Steven Rostedt wrote:
> On Wed, 13 Nov 2013 21:01:57 +0100 (CET)
> Thomas Gleixner <tg...@linutronix.de> wrote:
>
>
> > ---------------->
> > Subject: NOHZ: Check for nohz active instead of nohz enabled
> >
> > RCU and the fine grained idle time accounting functions check
> > tick_nohz_enabled. But that variable is merily telling that NOHZ has
> > been enabled in the config and not been disabled on the command line.
> >
> > But it does not tell anything about nohz being active. That's what all
> > this should check for.
> >
> > Matthew reported, that the idle accounting on his old P1 machine
> > showed bogus values, when he enabled NOHZ in the config and did not
> > disable it on the kernel command line. The reason is that his machine
> > uses (refined) jiffies as a clocksource which explains why the "fine"
> > grained accounting went into lala land, because it depends on when the
> > system goes and leaves idle relative to the jiffies increment.
> >
> > Provide a tick_nohz_active indicator and let RCU and the accounting
> > code use this instead of tick_nohz_enable.
> >
> > Reported-by: Matthew Whitehead <tedhe...@gmail.com>
>
> Matthew, can you test this patch to make sure it causes the idle issue
> to go away (remember to remove nohz=off from your command line).
>

Early testing looks good:

# Verify I haven't overridden nohz as a kernel parameter
root@host:~# grep nohz /proc/cmdline

# See how my kernel chose its clocksource
root@host:~# dmesg | egrep "TSC|clocksource"
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 2.492468] Switched to clocksource refined-jiffies
[ 24.197356] Switched to clocksource tsc
[ 26.211058] Switched to clocksource refined-jiffies

# Check idle time since the reboot
root@host:~# sar -s 16:28:00 -P ALL
Linux 3.12.0-rc6+ (host) 11/13/2013 _i586_ (2 CPU)

04:28:26 PM LINUX RESTART

04:32:02 PM CPU %user %nice %system %iowait %steal %idle
04:34:02 PM all 4.98 0.00 1.03 0.04 0.00 93.96
04:34:02 PM 0 4.07 0.00 0.69 0.03 0.00 95.21
04:34:02 PM 1 5.88 0.00 1.37 0.06 0.00 92.70
04:36:01 PM all 4.12 0.00 1.13 0.02 0.00 94.73
04:36:01 PM 0 1.98 0.00 1.34 0.02 0.00 96.66
04:36:01 PM 1 6.28 0.00 0.91 0.03 0.00 92.79
04:38:01 PM all 0.39 0.00 0.65 0.05 0.00 98.90
04:38:01 PM 0 0.33 0.00 0.54 0.03 0.00 99.09
04:38:01 PM 1 0.45 0.00 0.75 0.07 0.00 98.73
04:40:01 PM all 0.14 0.00 0.40 0.02 0.00 99.44
04:40:01 PM 0 0.12 0.00 0.26 0.01 0.00 99.62
04:40:01 PM 1 0.16 0.00 0.55 0.02 0.00 99.27
04:42:01 PM all 0.13 0.00 0.28 0.01 0.00 99.58
04:42:01 PM 0 0.09 0.00 0.14 0.01 0.00 99.76
04:42:01 PM 1 0.16 0.00 0.43 0.02 0.00 99.40
Average: all 1.95 0.00 0.70 0.03 0.00 97.33
Average: 0 1.32 0.00 0.59 0.02 0.00 98.07
Average: 1 2.58 0.00 0.80 0.04 0.00 96.58

I will continue to test but I think this is good.

Tested-by: Matthew Whitehead <tedhe...@gmail.com>

- Matthew W

Paul E. McKenney

unread,
Nov 18, 2013, 5:50:01 PM11/18/13
to
Reviewed-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>

tip-bot for Thomas Gleixner

unread,
Nov 19, 2013, 1:10:02 PM11/19/13
to
Commit-ID: d689fe222a858c767cb8594faf280048e532b53f
Gitweb: http://git.kernel.org/tip/d689fe222a858c767cb8594faf280048e532b53f
Author: Thomas Gleixner <tg...@linutronix.de>
AuthorDate: Wed, 13 Nov 2013 21:01:57 +0100
Committer: Thomas Gleixner <tg...@linutronix.de>
CommitDate: Tue, 19 Nov 2013 14:59:50 +0100

NOHZ: Check for nohz active instead of nohz enabled

RCU and the fine grained idle time accounting functions check
tick_nohz_enabled. But that variable is merily telling that NOHZ has
been enabled in the config and not been disabled on the command line.

But it does not tell anything about nohz being active. That's what all
this should check for.

Matthew reported, that the idle accounting on his old P1 machine
showed bogus values, when he enabled NOHZ in the config and did not
disable it on the kernel command line. The reason is that his machine
uses (refined) jiffies as a clocksource which explains why the "fine"
grained accounting went into lala land, because it depends on when the
system goes and leaves idle relative to the jiffies increment.

Provide a tick_nohz_active indicator and let RCU and the accounting
code use this instead of tick_nohz_enable.

Reported-and-tested-by: Matthew Whitehead <tedhe...@gmail.com>
Signed-off-by: Thomas Gleixner <tg...@linutronix.de>
Reviewed-by: Steven Rostedt <ros...@goodmis.org>
Reviewed-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>
Cc: john....@linaro.org
Cc: mwhi...@redhat.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1...@ionos.tec.linutronix.de
---
kernel/rcu/tree_plugin.h | 4 ++--
kernel/time/tick-sched.c | 21 +++++++++------------
2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 6abb03d..08a7652 100644

Viresh Kumar

unread,
Apr 9, 2014, 10:00:04 AM4/9/14
to
On Thu, Nov 14, 2013 at 1:31 AM, Thomas Gleixner <tg...@linutronix.de> wrote:
> Subject: NOHZ: Check for nohz active instead of nohz enabled
>
> RCU and the fine grained idle time accounting functions check
> tick_nohz_enabled. But that variable is merily telling that NOHZ has
> been enabled in the config and not been disabled on the command line.
>
> But it does not tell anything about nohz being active. That's what all
> this should check for.
>
> Matthew reported, that the idle accounting on his old P1 machine
> showed bogus values, when he enabled NOHZ in the config and did not
> disable it on the kernel command line. The reason is that his machine
> uses (refined) jiffies as a clocksource which explains why the "fine"
> grained accounting went into lala land, because it depends on when the
> system goes and leaves idle relative to the jiffies increment.
>
> Provide a tick_nohz_active indicator and let RCU and the accounting
> code use this instead of tick_nohz_enable.

> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> @@ -973,7 +968,7 @@ static void tick_nohz_switch_to_nohz(void)
> struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
> ktime_t next;
>
> - if (!tick_nohz_enabled)
> + if (!tick_nohz_active)
> return;

Considering the impressive list of Reviewed-by and people involved
in this patch, I am not sure I am reading the code well here.

The above change isn't required as per my understanding. Otherwise
we will never pass that check. tick_nohz_active is initialized as zero
and so we will keep on returning for ever and wouldn't be able to set
it to 1 ever.

I have a patch to fix it up, but wanted to know your opinion before
sending it.

> local_irq_disable();
> @@ -981,7 +976,7 @@ static void tick_nohz_switch_to_nohz(void)
> local_irq_enable();
> return;
> }
> -
> + tick_nohz_active = 1;
> ts->nohz_mode = NOHZ_MODE_LOWRES;
>
> /*

Steven Rostedt

unread,
Apr 9, 2014, 10:40:01 AM4/9/14
to
Ouch! You are correct, this part of the patch makes no sense. That's
what I get for reviewing a patch and not looking at all the code around
the changes. (another kernel developer hangs head in shame :-( )

I think that if statement should be nuked.

-- Steve

Frederic Weisbecker

unread,
Apr 9, 2014, 11:20:01 AM4/9/14
to
Ouch, bad thing.

I'm also disovering this whole thread and patch only today, since nobody Cc'ed me :-(

Viresh Kumar

unread,
Apr 9, 2014, 11:30:02 AM4/9/14
to
On 9 April 2014 20:01, Steven Rostedt <ros...@goodmis.org> wrote:
> Ouch! You are correct, this part of the patch makes no sense. That's
> what I get for reviewing a patch and not looking at all the code around
> the changes. (another kernel developer hangs head in shame :-( )
>
> I think that if statement should be nuked.

Hmm, my opinion differs here :)

If we completely remove this statement, we will run
tick_nohz_switch_to_nohz() even if nohz is not enabled. And check for
enabled must stay.

Steven Rostedt

unread,
Apr 9, 2014, 11:30:03 AM4/9/14
to
On Wed, 9 Apr 2014 20:50:59 +0530
Viresh Kumar <viresh...@linaro.org> wrote:

> On 9 April 2014 20:01, Steven Rostedt <ros...@goodmis.org> wrote:
> > Ouch! You are correct, this part of the patch makes no sense. That's
> > what I get for reviewing a patch and not looking at all the code around
> > the changes. (another kernel developer hangs head in shame :-( )
> >
> > I think that if statement should be nuked.
>
> Hmm, my opinion differs here :)
>
> If we completely remove this statement, we will run
> tick_nohz_switch_to_nohz() even if nohz is not enabled. And check for
> enabled must stay.

Do we? This is only called by tick_check_oneshot_change() which has the
following:

int tick_check_oneshot_change(int allow_nohz)
{
struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);

if (!test_and_clear_bit(0, &ts->check_clocks))
return 0;

if (ts->nohz_mode != NOHZ_MODE_INACTIVE)
return 0;

if (!timekeeping_valid_for_hres() || !tick_is_oneshot_available())
return 0;

if (!allow_nohz)
return 1;

tick_nohz_switch_to_nohz();
return 0;
}

How often does it make it to that last check?

-- Steve

Steven Rostedt

unread,
Apr 9, 2014, 11:40:01 AM4/9/14
to
On Wed, 9 Apr 2014 11:31:43 -0400
Steven Rostedt <ros...@goodmis.org> wrote:

>
> Hmm, looking at the code, I see it probably should still do the check.
>
> OK, nevermind ;-)

Reading even more of the code, now I'm totally confused :-)

When tick_setup_sched_timer() is called, if tick_nohz_enabled is set,
then we set tick_nohz_active.

This gets called by hrtimer_switch_to_hres(), and before that is
called, the tick_check_oneshot_changed() will never get to the
tick_nohz_switch_to_nohz() call.

Looks to me, the real answer is to nuke both the if statement *and* the
setting of the tick_nohz_active in that function. Both looks a bit
redundant to me.

Viresh Kumar

unread,
Apr 9, 2014, 11:40:02 AM4/9/14
to
On 9 April 2014 21:01, Steven Rostedt <ros...@goodmis.org> wrote:
>> Do we? This is only called by tick_check_oneshot_change() which has the
>> following:
>>
>> int tick_check_oneshot_change(int allow_nohz)
>> {
>> struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched);
>>
>> if (!test_and_clear_bit(0, &ts->check_clocks))
>> return 0;
>>
>> if (ts->nohz_mode != NOHZ_MODE_INACTIVE)
>> return 0;
>>
>> if (!timekeeping_valid_for_hres() || !tick_is_oneshot_available())
>> return 0;
>>
>> if (!allow_nohz)
>> return 1;
>>
>> tick_nohz_switch_to_nohz();
>> return 0;
>> }
>>
>> How often does it make it to that last check?

Probably we will reach here only once per boot per cpu.

> Hmm, looking at the code, I see it probably should still do the check.

But still we need it for that one time. :)

Steven Rostedt

unread,
Apr 9, 2014, 11:40:01 AM4/9/14
to
Hmm, looking at the code, I see it probably should still do the check.

OK, nevermind ;-)

Viresh Kumar

unread,
Apr 9, 2014, 12:00:04 PM4/9/14
to
On 9 April 2014 21:09, Steven Rostedt <ros...@goodmis.org> wrote:
> Reading even more of the code, now I'm totally confused :-)

:)

> When tick_setup_sched_timer() is called, if tick_nohz_enabled is set,
> then we set tick_nohz_active.

correct.

> This gets called by hrtimer_switch_to_hres(), and before that is
> called, the tick_check_oneshot_changed() will never get to the
> tick_nohz_switch_to_nohz() call.

If hrtimer_switch_to_hres() is called or HRES is enabled, we will
never ever call tick_nohz_switch_to_nohz().

> Looks to me, the real answer is to nuke both the if statement *and* the
> setting of the tick_nohz_active in that function. Both looks a bit
> redundant to me.

When HRES isn't enabled and NOHZ isn't enabled as well, in that
case we should stick to the periodic code from tick-common.c and
the oneshot options of tick_nohz_switch_to_nohz() or
hrtimer_switch_to_hres() shouldn't be used. And so, we still need
those checks, as per my understanding. :)

Lets see what others have for this discussion :)

Steven Rostedt

unread,
Apr 9, 2014, 12:20:02 PM4/9/14
to
On Wed, 9 Apr 2014 21:26:51 +0530
Viresh Kumar <viresh...@linaro.org> wrote:


> When HRES isn't enabled and NOHZ isn't enabled as well, in that
> case we should stick to the periodic code from tick-common.c and
> the oneshot options of tick_nohz_switch_to_nohz() or
> hrtimer_switch_to_hres() shouldn't be used. And so, we still need
> those checks, as per my understanding. :)

The one thing we can agree on is that the current code is wrong :-)

>
> Lets see what others have for this discussion :)

Yeah, those that actually wrote this code should have a say in this.

-- Steve
0 new messages