[PATCH] watchdog: Make sure the watchdog thread gets CPU on loaded system

Don Zickus

unread,

Mar 14, 2012, 4:40:03 PM3/14/12

to

From: Michal Hocko <mho...@suse.cz>

If the system is loaded while hotplugging a CPU we might end up with a bogus
hardlockup detection. This has been seen during LTP pounder test executed
in parallel with hotplug test.

The main problem is that enable_watchdog (called when CPU is brought up)
registers perf event which periodically checks per-cpu counter
(hrtimer_interrupts), updated from a hrtimer callback, but the hrtimer is fired
from the kernel thread.

This means that while we already do check for the hard lockup the kernel thread
might be sitting on the runqueue with zillions of tasks so there is nobody to
update the value we rely on and so we KABOOM.

Let's fix this by boosting the watchdog thread priority before we wake it up
rather than when it's already running.
This still doesn't handle a case where we have the same amount of high prio
FIFO tasks but that doesn't seem to be common. The current implementation
doesn't handle that case anyway so this is not worse at least.

Unfortunately, we cannot start perf counter from the watchdog thread because we
could miss a real lock up and also we cannot start the hrtimer watchdog_enable
because we there is no way (at least I don't know any) to start a hrtimer from
a different CPU.

[fix compile issue with param -dcz]

Cc: Ingo Molnar <mi...@elte.hu>
Cc: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Andrew Morton <ak...@linux-foundation.org>
Cc: Mandeep Singh Baines <m...@chromium.org>
Signed-off-by: Michal Hocko <mho...@suse.cz>
Signed-off-by: Don Zickus <dzi...@redhat.com>
---
kernel/watchdog.c | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index d117262..6618cde 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -321,11 +321,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
*/
static int watchdog(void *unused)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+ struct sched_param param = { .sched_priority = 0 };
struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);

- sched_setscheduler(current, SCHED_FIFO, &param);
-
/* initialize timestamp */
__touch_watchdog();

@@ -350,7 +348,6 @@ static int watchdog(void *unused)
set_current_state(TASK_INTERRUPTIBLE);
}
__set_current_state(TASK_RUNNING);
- param.sched_priority = 0;
sched_setscheduler(current, SCHED_NORMAL, &param);
return 0;
}
@@ -439,6 +436,7 @@ static int watchdog_enable(int cpu)

/* create the watchdog thread */
if (!p) {
+ struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
p = kthread_create_on_node(watchdog, NULL, cpu_to_node(cpu), "watchdog/%d", cpu);
if (IS_ERR(p)) {
printk(KERN_ERR "softlockup watchdog for %i failed\n", cpu);
@@ -450,6 +448,7 @@ static int watchdog_enable(int cpu)
}
goto out;
}
+ sched_setscheduler(p, SCHED_FIFO, &param);
kthread_bind(p, cpu);
per_cpu(watchdog_touch_ts, cpu) = 0;
per_cpu(softlockup_watchdog, cpu) = p;
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mandeep Singh Baines

unread,

Mar 14, 2012, 5:10:01 PM3/14/12

to

Reviewed-by: Mandeep Singh Baines <m...@chromium.org>

Andrew Morton

unread,

Mar 14, 2012, 7:20:03 PM3/14/12

to

On Wed, 14 Mar 2012 16:38:45 -0400
Don Zickus <dzi...@redhat.com> wrote:

> From: Michal Hocko <mho...@suse.cz>

This changelog is awful.

> If the system is loaded while hotplugging a CPU we might end up with a bogus
> hardlockup detection. This has been seen during LTP pounder test executed
> in parallel with hotplug test.
>
> The main problem is that enable_watchdog (called when CPU is brought up)

You mean watchdog_enable().

> registers perf event which periodically checks per-cpu counter
> (hrtimer_interrupts), updated from a hrtimer callback, but the hrtimer is fired

s/fired/started/

> from the kernel thread.

"the kernel thread" being kernel/watchdog.c:watchdog()

> This means that while we already do check for the hard lockup the kernel thread

Who is "we" and where in the kernel does this check occur?

"the kernel thread" is still kernel/watchdog.c:watchdog().

> might be sitting on the runqueue with zillions of tasks

What causes these "zillions of tasks"? Are they userspace tasks?
They're preventing the watchdog() function from being called in a
timely fashion, I assume?

> so there is nobody to
> update the value we rely on and so we KABOOM.

Who is "we" and what is "the value"?

etcetera. It is maddeningly inaccurate, vague and handwavy for someone
who is actually trying to understand what you're trying to tell us.

> Let's fix this by boosting the watchdog thread priority before we wake it up
> rather than when it's already running.
> This still doesn't handle a case where we have the same amount of high prio
> FIFO tasks but that doesn't seem to be common.

Even a single FIFO thread could starve the watchdog() thread.

> The current implementation
> doesn't handle that case anyway so this is not worse at least.

Right. But this isn't specific to the startup case, is it? A spinning
SCHED_FIFO thread could cause watchdog() to get starved of CPU for an
arbitrarily long time, triggering a false(?) lockup detection? Or did
we do something to prevent that case? I assume we did - it would be
pretty bad if this were to happen.

Why did watchdog() reset the scheduling policy seven instructions
before exiting? Seems pointless.

> @@ -439,6 +436,7 @@ static int watchdog_enable(int cpu)
>
> /* create the watchdog thread */
> if (!p) {
> + struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> p = kthread_create_on_node(watchdog, NULL, cpu_to_node(cpu), "watchdog/%d", cpu);
> if (IS_ERR(p)) {
> printk(KERN_ERR "softlockup watchdog for %i failed\n", cpu);
> @@ -450,6 +448,7 @@ static int watchdog_enable(int cpu)
> }
> goto out;
> }
> + sched_setscheduler(p, SCHED_FIFO, &param);
> kthread_bind(p, cpu);
> per_cpu(watchdog_touch_ts, cpu) = 0;
> per_cpu(softlockup_watchdog, cpu) = p;

It's pretty silly that kthread_create_on_node() sets the scheduling
policy and priority and then the caller immediately resets it. There
should be a version of kthread_create_on_node() whcih takes these as
arguments.

Oh well, despite all that the patch looks OK to me, after using
whiteout all over the changelog.

Mandeep Singh Baines

unread,

Mar 14, 2012, 9:50:02 PM3/14/12

to

My paraphrasing:

Set the task priority of the watchdog thread during creation. The current
implementation set the priority as one of the first few instructions from
the context of the watchdog thread. A false lockup can be detected because
the watchdog is not yet MAX_RT_PRIO - 1 so it can be prevented from
running due to a long runqueue or the running of a SCHED_FIFO process.
Once it changes its priority, this is no longer the case. The fix is to
set the priority to MAX_RT_PRIO -1 at creation time instead of at runtime.

> > Let's fix this by boosting the watchdog thread priority before we wake it up
> > rather than when it's already running.
> > This still doesn't handle a case where we have the same amount of high prio
> > FIFO tasks but that doesn't seem to be common.
>
> Even a single FIFO thread could starve the watchdog() thread.
>
> > The current implementation
> > doesn't handle that case anyway so this is not worse at least.
>
> Right. But this isn't specific to the startup case, is it? A spinning
> SCHED_FIFO thread could cause watchdog() to get starved of CPU for an
> arbitrarily long time, triggering a false(?) lockup detection? Or did
> we do something to prevent that case? I assume we did - it would be
> pretty bad if this were to happen.
>

I don't think anything prevents a SCHED_FIFO from preventing a false
lockup.

From sched.h:

/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
* tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
* values are inverted: lower p->prio value means higher priority.
*
* The MAX_USER_RT_PRIO value allows the actual maximum
* RT priority to be separate from the value exported to
* user-space. This allows kernel threads to set their
* priority to a value higher than any user task. Note:
* MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
*/

#define MAX_USER_RT_PRIO 100
#define MAX_RT_PRIO MAX_USER_RT_PRIO

You could make MAX_RT_PRIO greater than MAX_USER_RT_PRIO but that might
have some impact on real-time applications. A simple one-line patch:

- #define MAX_RT_PRIO MAX_USER_RT_PRIO
+ #define MAX_RT_PRIO (MAX_USER_RT_PRIO + 1)

would prevent user-space from causing a false lockup detection.

Regards,
Mandeep

Michal Hocko

unread,

Mar 15, 2012, 4:10:02 AM3/15/12

to

On Wed 14-03-12 16:19:06, Andrew Morton wrote:
> On Wed, 14 Mar 2012 16:38:45 -0400
> Don Zickus <dzi...@redhat.com> wrote:
>
> > From: Michal Hocko <mho...@suse.cz>
>
> This changelog is awful.

Sorry about that, What about this?

If the system is heavy loaded while hotplugging a CPU we might end up

with a bogus hardlockup detection. This has been seen during LTP pounder
test executed in parallel with hotplug test.

Hard lockup detector consist of two parts
- watchdog_overflow_callback (executed as a perf counter callback
from NMI) which checks whether per-cpu hrtimer_interrupts changed
since the last time it run and panics if not
- watchdog kernel thread which starts watchdog_hrtimer which
periodically updates hrtimer_interrupts.

The main problem is that watchdog_enable (called when CPU is brought up)
registers perf event but the hrtimer is started later when the watchdog
thread gets a chance to run.
The watchdog thread starts with a normal priority currently and boosts
itself as soon as it gets to a CPU. This might be, however, already too
late as demonstrated with the LTP pounder test executed in parallel with
LTP hotplug test. There are zillions of userspace processes sitting in
the runque in this workload while the number of CPUs gets down to 1 and
then they are onlined back to the original count.
When we online a CPU and create the watchdog kernel thread it will take
some time until it gets to a CPU. On the other hand the perf counter
callback is executed in the timely fashion so we explode the first time
it finds out there were no changes in the counter.

Let's fix this by boosting the watchdog thread priority before we wake it up
rather than when it's already running.
This still doesn't handle a case where we have the same amount of high prio

FIFO tasks but that doesn't seem to be common. The current implementation

doesn't handle that case anyway so this is not worse at least.

Unfortunately, we cannot start perf counter from the watchdog thread because we
could miss a real lock up and also we cannot start the hrtimer watchdog_enable
because we there is no way (at least I don't know any) to start a hrtimer from
a different CPU.

[...]

> > Let's fix this by boosting the watchdog thread priority before we wake it up
> > rather than when it's already running.
> > This still doesn't handle a case where we have the same amount of high prio
> > FIFO tasks but that doesn't seem to be common.
>
> Even a single FIFO thread could starve the watchdog() thread.

Only if preemption is off, I guess...

It has been introduced by Thomas in cba9bd22. To be honest I don't
understand why it makes a sense?
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

Peter Zijlstra

unread,

Mar 15, 2012, 7:10:01 AM3/15/12

to

On Thu, 2012-03-15 at 12:00 +0100, Peter Zijlstra wrote:

> On Wed, 2012-03-14 at 18:45 -0700, Mandeep Singh Baines wrote:
> > You could make MAX_RT_PRIO greater than MAX_USER_RT_PRIO but that
> > might
> > have some impact on real-time applications. A simple one-line patch:
> >
> > - #define MAX_RT_PRIO MAX_USER_RT_PRIO
> > + #define MAX_RT_PRIO (MAX_USER_RT_PRIO + 1)
> >
> > would prevent user-space from causing a false lockup detection.
>

> We're so not going to muck with the fifo priorities just for this stupid
> soft watchdog,.. I already hate that I can't disable the piece of crap,
> making it more involved is just really not going to happen.

And before people start to whinge about that, all the soft watchdog
issues I've seen fly by the past year or so all were bugs in the
watchdog itself, I can't actually remember it flagging a real problem.

The NMI watchdog otoh works like a charm for me and regularly helps out
when I done stupid.

Peter Zijlstra

unread,

Mar 15, 2012, 7:10:02 AM3/15/12

to

On Wed, 2012-03-14 at 18:45 -0700, Mandeep Singh Baines wrote:

> You could make MAX_RT_PRIO greater than MAX_USER_RT_PRIO but that
> might
> have some impact on real-time applications. A simple one-line patch:
>
> - #define MAX_RT_PRIO MAX_USER_RT_PRIO
> + #define MAX_RT_PRIO (MAX_USER_RT_PRIO + 1)
>
> would prevent user-space from causing a false lockup detection.

We're so not going to muck with the fifo priorities just for this stupid
soft watchdog,.. I already hate that I can't disable the piece of crap,
making it more involved is just really not going to happen.

Ingo Molnar

unread,

Mar 15, 2012, 8:50:02 AM3/15/12

to

* Peter Zijlstra <pet...@infradead.org> wrote:

> On Thu, 2012-03-15 at 12:00 +0100, Peter Zijlstra wrote:
> > On Wed, 2012-03-14 at 18:45 -0700, Mandeep Singh Baines wrote:
> > > You could make MAX_RT_PRIO greater than MAX_USER_RT_PRIO but that
> > > might
> > > have some impact on real-time applications. A simple one-line patch:
> > >
> > > - #define MAX_RT_PRIO MAX_USER_RT_PRIO
> > > + #define MAX_RT_PRIO (MAX_USER_RT_PRIO + 1)
> > >
> > > would prevent user-space from causing a false lockup detection.
> >
> > We're so not going to muck with the fifo priorities just for this stupid
> > soft watchdog,.. I already hate that I can't disable the piece of crap,
> > making it more involved is just really not going to happen.
>
> And before people start to whinge about that, all the soft
> watchdog issues I've seen fly by the past year or so all were
> bugs in the watchdog itself, I can't actually remember it
> flagging a real problem.

Its efficiency always depended on which area I was working on.
For syscall level stuff it helped me numerous times.

> The NMI watchdog otoh works like a charm for me and regularly
> helps out when I done stupid.

Sure, you are mostly working on perf events, the scheduler and
related core kernel areas so when you are stupid you get a hard
lockup or worse, quickly. Not much room for soft lockups.

So it's more of a case of selection bias, me thinks.

So unless there's concensus to remove everything but the hard
lockup detection facilities, lets solve the technical problem at
hand, ok?

Thanks,

Ingo

Peter Zijlstra

unread,

Mar 15, 2012, 10:10:02 AM3/15/12

to

On Thu, 2012-03-15 at 13:42 +0100, Ingo Molnar wrote:
> So unless there's concensus to remove everything but the hard
> lockup detection facilities, lets solve the technical problem at
> hand, ok?

Well, at least make it possible to disable the silly soft thing.

And I really wouldn't know how the soft thing could possible help,
except when not actually having a NMI watchdog. What case does it
trigger where the NMI one doesn't?

Don Zickus

unread,

Mar 15, 2012, 10:40:01 AM3/15/12

to

On Thu, Mar 15, 2012 at 03:00:51PM +0100, Peter Zijlstra wrote:
> On Thu, 2012-03-15 at 13:42 +0100, Ingo Molnar wrote:
> > So unless there's concensus to remove everything but the hard
> > lockup detection facilities, lets solve the technical problem at
> > hand, ok?
>
> Well, at least make it possible to disable the silly soft thing.
>
> And I really wouldn't know how the soft thing could possible help,
> except when not actually having a NMI watchdog. What case does it
> trigger where the NMI one doesn't?

I think softlockup really boils down to a pre-emption disabled detector
much like how the hardlockup really is a interrupts disabled detector.

The amount of code preventing the scheduler from running is most likely a
lot lower than the code the prevents interrrupts from happening.

Cheers,
Don

Mandeep Singh Baines

unread,

Mar 15, 2012, 11:40:02 AM3/15/12

to

Don Zickus (dzi...@redhat.com) wrote:
> On Thu, Mar 15, 2012 at 03:00:51PM +0100, Peter Zijlstra wrote:
> > On Thu, 2012-03-15 at 13:42 +0100, Ingo Molnar wrote:
> > > So unless there's concensus to remove everything but the hard
> > > lockup detection facilities, lets solve the technical problem at
> > > hand, ok?
> >
> > Well, at least make it possible to disable the silly soft thing.
> >
> > And I really wouldn't know how the soft thing could possible help,
> > except when not actually having a NMI watchdog. What case does it
> > trigger where the NMI one doesn't?
>
> I think softlockup really boils down to a pre-emption disabled detector
> much like how the hardlockup really is a interrupts disabled detector.
>
> The amount of code preventing the scheduler from running is most likely a
> lot lower than the code the prevents interrrupts from happening.
>

Its a good tool for catching problems of scale. As we move to more and
more cores you'll uncover bugs where data structures start to blow up.
Hash tables get huge, when you have 100000s of processes or millions of
TCP flows, or cgroups or namespace. That critical section (spinlock,
spinlock_bh, or preempt_disable) that used to be OK might no longer be.

There are some labs that are already there:

http://www.wine-reviews.net/wine-reviews/news/megatux-to-run-1-million-copies-on-wine-to-simulate-the-internet.html

With the softlockup detector, they'll get a useful stack trace in dmesg
that they can then send to lkml so that we can fix a scalability issue
that we hadn't previously known about.

Regards,
Mandeep

Don Zickus

unread,

Mar 15, 2012, 12:00:03 PM3/15/12

to

On Thu, Mar 15, 2012 at 09:02:32AM +0100, Michal Hocko wrote:
> On Wed 14-03-12 16:19:06, Andrew Morton wrote:
> > On Wed, 14 Mar 2012 16:38:45 -0400
> > Don Zickus <dzi...@redhat.com> wrote:
> >
> > > From: Michal Hocko <mho...@suse.cz>
> >
> > This changelog is awful.

My apologies too, Andrew for not being more diligent.

Some nitpicks below (hopefully it isn't too picky :-( )

>
> Sorry about that, What about this?
>
> If the system is heavy loaded while hotplugging a CPU we might end up

heavily ,

> with a bogus hardlockup detection. This has been seen during LTP pounder

the

> test executed in parallel with hotplug test.

the

>
> Hard lockup detector consist of two parts
> - watchdog_overflow_callback (executed as a perf counter callback
> from NMI) which checks whether per-cpu hrtimer_interrupts changed
> since the last time it run and panics if not
> - watchdog kernel thread which starts watchdog_hrtimer which
> periodically updates hrtimer_interrupts.
>
> The main problem is that watchdog_enable (called when CPU is brought up)

a

> registers perf event but the hrtimer is started later when the watchdog

a

> thread gets a chance to run.

> The watchdog thread starts with a normal priority currently and boosts

^^^^^^^(remove?)

> itself as soon as it gets to a CPU. This might be, however, already too

^^^ replace with 'runs on'

> late as demonstrated with the LTP pounder test executed in parallel with

^^^^ replace with 'by'

> LTP hotplug test. There are zillions of userspace processes sitting in
> the runque in this workload while the number of CPUs gets down to 1 and
> then they are onlined back to the original count.

sounds awkward, how about
"while the number of active CPUS (after
soft-unplugging) is 1. Then all the cpus are soft-plugged back online."

> When we online a CPU and create the watchdog kernel thread it will take
> some time until it gets to a CPU. On the other hand the perf counter
> callback is executed in the timely fashion so we explode the first time
> it finds out there were no changes in the counter.

^^^^^ perhaps "it finds out hrtimer_interrupts was not incremented"

>
> Let's fix this by boosting the watchdog thread priority before we wake it up
> rather than when it's already running.
> This still doesn't handle a case where we have the same amount of high prio
> FIFO tasks but that doesn't seem to be common. The current implementation
> doesn't handle that case anyway so this is not worse at least.

^^^^ "is no worse."??

>
> Unfortunately, we cannot start perf counter from the watchdog thread because we
> could miss a real lock up and also we cannot start the hrtimer watchdog_enable

^ from

> because we there is no way (at least I don't know any) to start a hrtimer from

s/we//

> a different CPU.
>
> [...]
> > > Let's fix this by boosting the watchdog thread priority before we wake it up
> > > rather than when it's already running.
> > > This still doesn't handle a case where we have the same amount of high prio
> > > FIFO tasks but that doesn't seem to be common.
> >
> > Even a single FIFO thread could starve the watchdog() thread.
>
> Only if preemption is off, I guess...

I was going suggest that is a good case for touch_softlockup(), but if the
thread is in userspace that won't work.

>
> > > The current implementation
> > > doesn't handle that case anyway so this is not worse at least.
> >
> > Right. But this isn't specific to the startup case, is it? A spinning
> > SCHED_FIFO thread could cause watchdog() to get starved of CPU for an
> > arbitrarily long time, triggering a false(?) lockup detection? Or did
> > we do something to prevent that case? I assume we did - it would be
> > pretty bad if this were to happen.

Well either the thread should use touch_softlockup() (if possible) or we
need to have a higher priority for the softlockup thread to prevent
userspace from blocking it.

Yeah I noticed that too. I didn't bother questioning it either when it
went in. I just assumed Thomas and Peter know scheduling a lot better
than I do. :-)

Cheers,
Don

Peter Zijlstra

unread,

Mar 15, 2012, 12:10:02 PM3/15/12

to

On Thu, 2012-03-15 at 11:54 -0400, Don Zickus wrote:
> > > Why did watchdog() reset the scheduling policy seven instructions
> > > before exiting? Seems pointless.
> >
> > It has been introduced by Thomas in cba9bd22. To be honest I don't
> > understand why it makes a sense?
>
> Yeah I noticed that too. I didn't bother questioning it either when
> it
> went in. I just assumed Thomas and Peter know scheduling a lot better
> than I do. :-)

I just dug through my IRC logs and the reason is that running the entire
exit path as a highest priority RT task incurs a latency spike to other
tasks running on the system (313us). Since there's absolutely no point
in running the exit path as a RT task, tglx made it not do that.

Peter Zijlstra

unread,

Mar 15, 2012, 12:20:01 PM3/15/12

to

On Thu, 2012-03-15 at 17:10 +0100, Peter Zijlstra wrote:

> On Thu, 2012-03-15 at 08:39 -0700, Mandeep Singh Baines wrote:
> > Its a good tool for catching problems of scale. As we move to more and
> > more cores you'll uncover bugs where data structures start to blow up.
> > Hash tables get huge, when you have 100000s of processes or millions
> > of
> > TCP flows, or cgroups or namespace. That critical section (spinlock,
> > spinlock_bh, or preempt_disable) that used to be OK might no longer
> > be.
>

> Or you run with the preempt latency tracer.

Or for that matter run cyclictest...

Peter Zijlstra

unread,

Mar 15, 2012, 12:20:01 PM3/15/12

to

On Thu, 2012-03-15 at 08:39 -0700, Mandeep Singh Baines wrote:
> Its a good tool for catching problems of scale. As we move to more and
> more cores you'll uncover bugs where data structures start to blow up.
> Hash tables get huge, when you have 100000s of processes or millions
> of
> TCP flows, or cgroups or namespace. That critical section (spinlock,
> spinlock_bh, or preempt_disable) that used to be OK might no longer
> be.

Or you run with the preempt latency tracer.

Peter Zijlstra

unread,

Mar 15, 2012, 12:20:02 PM3/15/12

to

On Thu, 2012-03-15 at 17:11 +0100, Peter Zijlstra wrote:
> On Thu, 2012-03-15 at 17:10 +0100, Peter Zijlstra wrote:
> > On Thu, 2012-03-15 at 08:39 -0700, Mandeep Singh Baines wrote:
> > > Its a good tool for catching problems of scale. As we move to more and
> > > more cores you'll uncover bugs where data structures start to blow up.
> > > Hash tables get huge, when you have 100000s of processes or millions
> > > of
> > > TCP flows, or cgroups or namespace. That critical section (spinlock,
> > > spinlock_bh, or preempt_disable) that used to be OK might no longer
> > > be.
> >
> > Or you run with the preempt latency tracer.
>
> Or for that matter run cyclictest...

Thing is, if you want a latency detector, call it that and stop
pretending its a useful debug feature. Also, if you want that, set the
interval in the 0.1-0.5 seconds range and dump stack on every new max.

Michal Hocko

unread,

Mar 15, 2012, 12:20:02 PM3/15/12

to

On Thu 15-03-12 11:54:13, Don Zickus wrote:
> On Thu, Mar 15, 2012 at 09:02:32AM +0100, Michal Hocko wrote:
> > On Wed 14-03-12 16:19:06, Andrew Morton wrote:
> > > On Wed, 14 Mar 2012 16:38:45 -0400
> > > Don Zickus <dzi...@redhat.com> wrote:
> > >
> > > > From: Michal Hocko <mho...@suse.cz>
> > >
> > > This changelog is awful.
>
> My apologies too, Andrew for not being more diligent.
>
> Some nitpicks below (hopefully it isn't too picky :-( )

Thanks! Updated
---
From a8da58750ba78d737136a4df24af805cb936ee00 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mho...@suse.cz>
Date: Tue, 13 Mar 2012 10:34:44 +0100
Subject: [PATCH] watchdog: make sure the watchdog thread gets CPU on loaded
system

If the system is heavily loaded while hotplugging a CPU, we might end up

with a bogus hardlockup detection. This has been seen during LTP pounder

test executed in parallel with the hotplug test.

Hard lockup detector consist of two parts
- watchdog_overflow_callback (executed as a perf counter callback
from NMI) which checks whether per-cpu hrtimer_interrupts changed
since the last time it run and panics if not
- watchdog kernel thread which starts watchdog_hrtimer which
periodically updates hrtimer_interrupts.

The main problem is that watchdog_enable (called when a CPU is brought up)
registers a perf event but the hrtimer is started later when the watchdog

thread gets a chance to run.

The watchdog thread starts with a normal priority currently and boosts

itself as soon as it gets to a CPU. This might be, however, already too

late as demonstrated with the LTP pounder test executed in parallel by

LTP hotplug test. There are zillions of userspace processes sitting in

the runque while the number of online CPUs gets down to 1. CPUs are
onlined back in the second stage where the issue triggers.

When we online a CPU and create the watchdog kernel thread it will take
some time until it gets to a CPU. On the other hand the perf counter
callback is executed in the timely fashion so we explode the first time

it finds out that the hrtimer_interrupts wasn't incremented.

Let's fix this by boosting the watchdog thread priority before we wake it up
rather than when it's already running.
This still doesn't handle a case where we have the same amount of high prio
FIFO tasks but that doesn't seem to be common. The current implementation

doesn't handle that case anyway so this is no worse at least.

Unfortunately, we cannot start perf counter from the watchdog thread
because we could miss a real lock up and also we cannot start the

hrtimer from watchdog_enable because we there is no way (at least I

don't know any) to start a hrtimer from a different CPU.

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

Mandeep Singh Baines

unread,

Mar 15, 2012, 1:10:03 PM3/15/12

to

Peter Zijlstra (pet...@infradead.org) wrote:
> On Thu, 2012-03-15 at 17:11 +0100, Peter Zijlstra wrote:
> > On Thu, 2012-03-15 at 17:10 +0100, Peter Zijlstra wrote:
> > > On Thu, 2012-03-15 at 08:39 -0700, Mandeep Singh Baines wrote:
> > > > Its a good tool for catching problems of scale. As we move to more and
> > > > more cores you'll uncover bugs where data structures start to blow up.
> > > > Hash tables get huge, when you have 100000s of processes or millions
> > > > of
> > > > TCP flows, or cgroups or namespace. That critical section (spinlock,
> > > > spinlock_bh, or preempt_disable) that used to be OK might no longer
> > > > be.
> > >
> > > Or you run with the preempt latency tracer.
> >
> > Or for that matter run cyclictest...
>
> Thing is, if you want a latency detector, call it that and stop
> pretending its a useful debug feature. Also, if you want that, set the
> interval in the 0.1-0.5 seconds range and dump stack on every new max.
>
>

But preempt latency tracer is not negligible overhead while the softlockup
detector is. Softlockup is a great tool to use for detecting temporary
long duration lockups that can occur when data structures blow up.
Because of the overhead, you probably wouldn't enable preempt latency
tracking in production. If the problems is happening often enough, you
might temporarily turn it on for a few machines to get a stack trace. But
you might not have the luxury of being able to do that.

I can't predict what my users are going to do. They will do things I never
expected. So I can't test these cases in the lab, ruling out latency preempt
detector. With softlockup, I can find out about problems I never even
knew I had.

In addition, softlockup is also a great tool for find permanent lockups.

One idea for reducing preempt latency tracer overhead would use the same
approach that the HW counters use. Instead of examing every preempt
enable/disable, only examine 1 in a 1000 (some configurable numbers).
That way you could turn it on in production. Maybe a simple per_cpu
counter.

Regards,
Mandeep

Don Zickus

unread,

Mar 15, 2012, 1:20:02 PM3/15/12

to

On Thu, Mar 15, 2012 at 05:14:22PM +0100, Michal Hocko wrote:
> On Thu 15-03-12 11:54:13, Don Zickus wrote:
> > On Thu, Mar 15, 2012 at 09:02:32AM +0100, Michal Hocko wrote:
> > > On Wed 14-03-12 16:19:06, Andrew Morton wrote:
> > > > On Wed, 14 Mar 2012 16:38:45 -0400
> > > > Don Zickus <dzi...@redhat.com> wrote:
> > > >
> > > > > From: Michal Hocko <mho...@suse.cz>
> > > >
> > > > This changelog is awful.
> >
> > My apologies too, Andrew for not being more diligent.
> >
> > Some nitpicks below (hopefully it isn't too picky :-( )
>
> Thanks! Updated

I think it looks fine. Is this ok now Andrew? I can respin this.

Cheers,
Don

Andrew Morton

unread,

Mar 19, 2012, 6:10:02 PM3/19/12

to

On Thu, 15 Mar 2012 17:04:31 +0100
Peter Zijlstra <a.p.zi...@chello.nl> wrote:

> On Thu, 2012-03-15 at 11:54 -0400, Don Zickus wrote:
> > > > Why did watchdog() reset the scheduling policy seven instructions
> > > > before exiting? Seems pointless.
> > >
> > > It has been introduced by Thomas in cba9bd22. To be honest I don't
> > > understand why it makes a sense?
> >
> > Yeah I noticed that too. I didn't bother questioning it either when
> > it
> > went in. I just assumed Thomas and Peter know scheduling a lot better
> > than I do. :-)
>
> I just dug through my IRC logs and the reason is that running the entire
> exit path as a highest priority RT task incurs a latency spike to other
> tasks running on the system (313us). Since there's absolutely no point
> in running the exit path as a RT task, tglx made it not do that.

IRC logs, huh?

--- a/kernel/watchdog.c~a
+++ a/kernel/watchdog.c
@@ -349,6 +349,10 @@ static int watchdog(void *unused)

set_current_state(TASK_INTERRUPTIBLE);
}
+ /*
+ * Drop the policy/priority elevation during thread exit to avoid a
+ * scheduling latency spike.
+ */
__set_current_state(TASK_RUNNING);

sched_setscheduler(current, SCHED_NORMAL, &param);
return 0;

_