Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[PATCH 1/2] watchdog, nmi: Allow hardlockup to panic by default

74 views
Skip to first unread message

Don Zickus

unread,
Mar 3, 2011, 3:40:02 PM3/3/11
to
Add a Kconfig option to allow users to set the hardlockup to panic
by default. Also add in a 'nmi_watchdog=nopanic' to override this.

Signed-off-by: Don Zickus <dzi...@redhat.com>

---
Forgot to cc lkml, sorry for the spam

---
Documentation/kernel-parameters.txt | 5 +++--
kernel/watchdog.c | 5 ++++-
lib/Kconfig.debug | 17 +++++++++++++++++
3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 89835a4..ae0b499 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1577,11 +1577,12 @@ and is between 256 and 4096 characters. It is defined in the file
Format: [state][,regs][,debounce][,die]

nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
- Format: [panic,][num]
+ Format: [panic,][nopanic,][num]
Valid num: 0
0 - turn nmi_watchdog off
When panic is specified, panic when an NMI watchdog
- timeout occurs.
+ timeout occurs (or 'nopanic' to override the opposite
+ default).
This is useful when you use a panic=... timeout and
need the box quickly up again.

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 18bb157..f7c0272 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -48,12 +48,15 @@ static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
* Should we panic when a soft-lockup or hard-lockup occurs:
*/
#ifdef CONFIG_HARDLOCKUP_DETECTOR
-static int hardlockup_panic;
+static int hardlockup_panic =
+ CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE;

static int __init hardlockup_panic_setup(char *str)
{
if (!strncmp(str, "panic", 5))
hardlockup_panic = 1;
+ else if (!strncmp(str, "nopanic", 5))
+ hardlockup_panic = 0;
else if (!strncmp(str, "0", 1))
watchdog_enabled = 0;
return 1;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2b97418..80bd292 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -176,6 +176,23 @@ config HARDLOCKUP_DETECTOR
def_bool LOCKUP_DETECTOR && PERF_EVENTS && HAVE_PERF_EVENTS_NMI && \
!ARCH_HAS_NMI_WATCHDOG

+config BOOTPARAM_HARDLOCKUP_PANIC
+ bool "Panic (Reboot) On Soft Lockups"
+ depends on LOCKUP_DETECTOR
+ help
+ Say Y here to enable the kernel to panic on "hard lockups",
+ which are bugs that cause the kernel to loop in kernel
+ mode with interrupts disabled for more than 60 seconds.
+
+ Say N if unsure.
+
+config BOOTPARAM_HARDLOCKUP_PANIC_VALUE
+ int
+ depends on LOCKUP_DETECTOR
+ range 0 1
+ default 0 if !BOOTPARAM_HARDLOCKUP_PANIC
+ default 1 if BOOTPARAM_HARDLOCKUP_PANIC
+
config BOOTPARAM_SOFTLOCKUP_PANIC
bool "Panic (Reboot) On Soft Lockups"
depends on LOCKUP_DETECTOR
--
1.7.3.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Don Zickus

unread,
Mar 3, 2011, 3:40:02 PM3/3/11
to
This patch addresses a couple of problems. One was the case when the
hardlockup failed to start, it also failed to start the softlockup.
There were valid cases when the hardlockup shouldn't start and that
shouldn't block the softlockup (no lapic, bios controls perf counters).

The second problem was when the hardlockup failed to start on boxes
(from a no lapic or bios controlled perf counter case), it reported
failure to the cpu notifier chain. This blocked the notifier from
continuing to start other more critical pieces of cpu bring-up (in
our case based on a 2.6.32 fork, it was the mce). As a result,
during soft cpu online/offline testing, the system would panic
when a cpu was offlined because the cpu notifier would succeed in
processing a watchdog disable cpu event and would panic in the mce
case as a result of un-initialized variables from a never executed
cpu up event.

I realized the hardlockup/softlockup cases are really just debugging
aids and should never impede the progress of a cpu up/down event.
Therefore I modified the code to always return NOTIFY_OK and instead
rely on printks to inform the user of problems.

Signed-off-by: Don Zickus <dzi...@redhat.com>
---
Forgot to cc lkml, sorry for the spam

---
kernel/watchdog.c | 22 ++++++++++++++++------
1 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index f7c0272..c52645b 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -418,19 +418,22 @@ static int watchdog_prepare_cpu(int cpu)
static int watchdog_enable(int cpu)
{
struct task_struct *p = per_cpu(softlockup_watchdog, cpu);
- int err;
+ int err = 0;

/* enable the perf event */
err = watchdog_nmi_enable(cpu);
- if (err)
- return err;
+
+ /* Regardless of err above, fall through and start softlockup */

/* create the watchdog thread */
if (!p) {
p = kthread_create(watchdog, (void *)(unsigned long)cpu, "watchdog/%d", cpu);
if (IS_ERR(p)) {
printk(KERN_ERR "softlockup watchdog for %i failed\n", cpu);
- return PTR_ERR(p);
+ if (!err)
+ /* if hardlockup hasn't already set this */
+ err = PTR_ERR(p);
+ goto out;
}
kthread_bind(p, cpu);
per_cpu(watchdog_touch_ts, cpu) = 0;
@@ -438,7 +441,8 @@ static int watchdog_enable(int cpu)
wake_up_process(p);
}

- return 0;
+out:
+ return err;
}

static void watchdog_disable(int cpu)
@@ -550,7 +554,13 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
break;
#endif /* CONFIG_HOTPLUG_CPU */
}
- return notifier_from_errno(err);
+
+ /*
+ * hardlockup and softlockup are not important enough
+ * to block cpu bring up. Just always succeed and
+ * rely on printk output to flag problems.
+ */
+ return NOTIFY_OK;
}

static struct notifier_block __cpuinitdata cpu_nfb = {

Jack Stone

unread,
Mar 4, 2011, 6:20:02 PM3/4/11
to
On 03/03/2011 20:33, Don Zickus wrote:
> Add a Kconfig option to allow users to set the hardlockup to panic
> by default. Also add in a 'nmi_watchdog=nopanic' to override this.
>
> Signed-off-by: Don Zickus <dzi...@redhat.com>
>
[snip]

> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 2b97418..80bd292 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -176,6 +176,23 @@ config HARDLOCKUP_DETECTOR
> def_bool LOCKUP_DETECTOR && PERF_EVENTS && HAVE_PERF_EVENTS_NMI && \
> !ARCH_HAS_NMI_WATCHDOG
>
> +config BOOTPARAM_HARDLOCKUP_PANIC
> + bool "Panic (Reboot) On Soft Lockups"

Did you mean Hard Lockups here?

Thanks,

Jack

Don Zickus

unread,
Mar 7, 2011, 4:40:01 PM3/7/11
to
This patch addresses a couple of problems. One was the case when the
hardlockup failed to start, it also failed to start the softlockup.
There were valid cases when the hardlockup shouldn't start and that
shouldn't block the softlockup (no lapic, bios controls perf counters).

The second problem was when the hardlockup failed to start on boxes
(from a no lapic or bios controlled perf counter case), it reported
failure to the cpu notifier chain. This blocked the notifier from
continuing to start other more critical pieces of cpu bring-up (in
our case based on a 2.6.32 fork, it was the mce). As a result,
during soft cpu online/offline testing, the system would panic
when a cpu was offlined because the cpu notifier would succeed in
processing a watchdog disable cpu event and would panic in the mce
case as a result of un-initialized variables from a never executed
cpu up event.

I realized the hardlockup/softlockup cases are really just debugging
aids and should never impede the progress of a cpu up/down event.
Therefore I modified the code to always return NOTIFY_OK and instead
rely on printks to inform the user of problems.

Signed-off-by: Don Zickus <dzi...@redhat.com>
---

--

Don Zickus

unread,
Mar 7, 2011, 4:40:02 PM3/7/11
to
Add a Kconfig option to allow users to set the hardlockup to panic
by default. Also add in a 'nmi_watchdog=nopanic' to override this.

v2:
clean up a typo

Signed-off-by: Don Zickus <dzi...@redhat.com>
---

Documentation/kernel-parameters.txt | 5 +++--
kernel/watchdog.c | 5 ++++-
lib/Kconfig.debug | 17 +++++++++++++++++
3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 89835a4..ae0b499 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1577,11 +1577,12 @@ and is between 256 and 4096 characters. It is defined in the file
Format: [state][,regs][,debounce][,die]

nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
- Format: [panic,][num]
+ Format: [panic,][nopanic,][num]
Valid num: 0
0 - turn nmi_watchdog off
When panic is specified, panic when an NMI watchdog
- timeout occurs.
+ timeout occurs (or 'nopanic' to override the opposite
+ default).
This is useful when you use a panic=... timeout and
need the box quickly up again.

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 18bb157..f7c0272 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c


@@ -48,12 +48,15 @@ static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
* Should we panic when a soft-lockup or hard-lockup occurs:
*/
#ifdef CONFIG_HARDLOCKUP_DETECTOR
-static int hardlockup_panic;
+static int hardlockup_panic =
+ CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE;

static int __init hardlockup_panic_setup(char *str)
{
if (!strncmp(str, "panic", 5))
hardlockup_panic = 1;
+ else if (!strncmp(str, "nopanic", 5))
+ hardlockup_panic = 0;
else if (!strncmp(str, "0", 1))
watchdog_enabled = 0;
return 1;

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2b97418..8769533 100644


--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -176,6 +176,23 @@ config HARDLOCKUP_DETECTOR
def_bool LOCKUP_DETECTOR && PERF_EVENTS && HAVE_PERF_EVENTS_NMI && \
!ARCH_HAS_NMI_WATCHDOG

+config BOOTPARAM_HARDLOCKUP_PANIC

+ bool "Panic (Reboot) On Hard Lockups"


+ depends on LOCKUP_DETECTOR
+ help
+ Say Y here to enable the kernel to panic on "hard lockups",
+ which are bugs that cause the kernel to loop in kernel
+ mode with interrupts disabled for more than 60 seconds.
+
+ Say N if unsure.
+
+config BOOTPARAM_HARDLOCKUP_PANIC_VALUE
+ int
+ depends on LOCKUP_DETECTOR
+ range 0 1
+ default 0 if !BOOTPARAM_HARDLOCKUP_PANIC
+ default 1 if BOOTPARAM_HARDLOCKUP_PANIC
+
config BOOTPARAM_SOFTLOCKUP_PANIC

bool "Panic (Reboot) On Soft Lockups"

depends on LOCKUP_DETECTOR

Don Zickus

unread,
Mar 7, 2011, 4:40:02 PM3/7/11
to
On Fri, Mar 04, 2011 at 11:15:21PM +0000, Jack Stone wrote:
> On 03/03/2011 20:33, Don Zickus wrote:
> > Add a Kconfig option to allow users to set the hardlockup to panic
> > by default. Also add in a 'nmi_watchdog=nopanic' to override this.
> >
> > Signed-off-by: Don Zickus <dzi...@redhat.com>
> >
> [snip]
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index 2b97418..80bd292 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -176,6 +176,23 @@ config HARDLOCKUP_DETECTOR
> > def_bool LOCKUP_DETECTOR && PERF_EVENTS && HAVE_PERF_EVENTS_NMI && \
> > !ARCH_HAS_NMI_WATCHDOG
> >
> > +config BOOTPARAM_HARDLOCKUP_PANIC
> > + bool "Panic (Reboot) On Soft Lockups"
>
> Did you mean Hard Lockups here?

Yes. Thanks! :-)

Cheers,
Don

Peter Zijlstra

unread,
Mar 17, 2011, 5:20:02 AM3/17/11
to
On Mon, 2011-03-07 at 16:37 -0500, Don Zickus wrote:
>
> This patch addresses a couple of problems. One was the case when the
> hardlockup failed to start, it also failed to start the softlockup.
> There were valid cases when the hardlockup shouldn't start and that
> shouldn't block the softlockup (no lapic, bios controls perf
> counters).
>
> The second problem was when the hardlockup failed to start on boxes
> (from a no lapic or bios controlled perf counter case), it reported
> failure to the cpu notifier chain. This blocked the notifier from
> continuing to start other more critical pieces of cpu bring-up (in
> our case based on a 2.6.32 fork, it was the mce). As a result,
> during soft cpu online/offline testing, the system would panic
> when a cpu was offlined because the cpu notifier would succeed in
> processing a watchdog disable cpu event and would panic in the mce
> case as a result of un-initialized variables from a never executed
> cpu up event.
>
> I realized the hardlockup/softlockup cases are really just debugging
> aids and should never impede the progress of a cpu up/down event.
> Therefore I modified the code to always return NOTIFY_OK and instead
> rely on printks to inform the user of problems.
>
> Signed-off-by: Don Zickus <dzi...@redhat.com>

Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>

Peter Zijlstra

unread,
Mar 17, 2011, 5:20:02 AM3/17/11
to
On Mon, 2011-03-07 at 16:37 -0500, Don Zickus wrote:
> Add a Kconfig option to allow users to set the hardlockup to panic
> by default. Also add in a 'nmi_watchdog=nopanic' to override this.
>
> v2:
> clean up a typo
>
> Signed-off-by: Don Zickus <dzi...@redhat.com>

Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>

WANG Cong

unread,
Mar 17, 2011, 8:10:01 AM3/17/11
to
On Mon, 07 Mar 2011 16:37:39 -0500, Don Zickus wrote:

> Add a Kconfig option to allow users to set the hardlockup to panic by
> default. Also add in a 'nmi_watchdog=nopanic' to override this.
>
> v2:
> clean up a typo
>
> Signed-off-by: Don Zickus <dzi...@redhat.com>

Looks good, I believe this is good for kdump.

Reviewed-by: WANG Cong <xiyou.w...@gmail.com>

Thanks.

WANG Cong

unread,
Mar 17, 2011, 8:20:02 AM3/17/11
to
On Mon, 07 Mar 2011 16:37:40 -0500, Don Zickus wrote:

> This patch addresses a couple of problems. One was the case when the
> hardlockup failed to start, it also failed to start the softlockup.
> There were valid cases when the hardlockup shouldn't start and that
> shouldn't block the softlockup (no lapic, bios controls perf counters).
>
> The second problem was when the hardlockup failed to start on boxes
> (from a no lapic or bios controlled perf counter case), it reported
> failure to the cpu notifier chain. This blocked the notifier from
> continuing to start other more critical pieces of cpu bring-up (in our
> case based on a 2.6.32 fork, it was the mce). As a result, during soft
> cpu online/offline testing, the system would panic when a cpu was
> offlined because the cpu notifier would succeed in processing a watchdog
> disable cpu event and would panic in the mce case as a result of
> un-initialized variables from a never executed cpu up event.

What I saw is microcode, its /sys entries failed to come up and this
triggers a warning when these entries are removed when the CPU became
offline again.

>
> I realized the hardlockup/softlockup cases are really just debugging
> aids and should never impede the progress of a cpu up/down event.
> Therefore I modified the code to always return NOTIFY_OK and instead
> rely on printks to inform the user of problems.
>

Yeah, it should also fix the problem I saw.

Reviewed-by: WANG Cong <xiyou.w...@gmail.com>

Thanks.

--

Andrew Morton

unread,
Mar 17, 2011, 10:00:02 PM3/17/11
to
On Mon, 7 Mar 2011 16:37:39 -0500 Don Zickus <dzi...@redhat.com> wrote:

> Add a Kconfig option to allow users to set the hardlockup to panic
> by default. Also add in a 'nmi_watchdog=nopanic' to override this.
>

Changelog forgot to tell us "why".

> Format: [state][,regs][,debounce][,die]
>
> nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
> - Format: [panic,][num]
> + Format: [panic,][nopanic,][num]

It would be better to support panic=[0|1], if that can be simply done
in a back-compatible fashion.

> static int __init hardlockup_panic_setup(char *str)
> {
> if (!strncmp(str, "panic", 5))
> hardlockup_panic = 1;
> + else if (!strncmp(str, "nopanic", 5))

s/5/7/

> + hardlockup_panic = 0;
> else if (!strncmp(str, "0", 1))
> watchdog_enabled = 0;
> return 1;

--

Don Zickus

unread,
Mar 18, 2011, 1:30:03 PM3/18/11
to
On Thu, Mar 17, 2011 at 06:50:13PM -0700, Andrew Morton wrote:
> On Mon, 7 Mar 2011 16:37:39 -0500 Don Zickus <dzi...@redhat.com> wrote:
>
> > Add a Kconfig option to allow users to set the hardlockup to panic
> > by default. Also add in a 'nmi_watchdog=nopanic' to override this.
> >
>
> Changelog forgot to tell us "why".

Yeah, sorry about that.

When a cpu is considered stuck, instead of limping along and just printing
a warning, it is sometimes preferred to just panic, let kdump capture the
vmcore and reboot. This gets the machine back into a stable state quickly
while saving the info that got it into a stuck state to begin with.


>
> > Format: [state][,regs][,debounce][,die]
> >
> > nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
> > - Format: [panic,][num]
> > + Format: [panic,][nopanic,][num]
>
> It would be better to support panic=[0|1], if that can be simply done
> in a back-compatible fashion.

I am open to the idea, just can't figure the best way to implement that in
a backwards compatible way.

Personally I was wondering if there were situations where you would _not_
want it to panic. If the cpu is stuck spinning after 60 seconds, the odds
of it freeing itself is low and you are probably stuck rebooting anyway.


>
> > static int __init hardlockup_panic_setup(char *str)
> > {
> > if (!strncmp(str, "panic", 5))
> > hardlockup_panic = 1;
> > + else if (!strncmp(str, "nopanic", 5))
>
> s/5/7/

doh.

I can send a refreshed patch with the above changes.

Cheers,
Don

Andrew Morton

unread,
Mar 18, 2011, 2:30:01 PM3/18/11
to
On Fri, 18 Mar 2011 13:19:32 -0400
Don Zickus <dzi...@redhat.com> wrote:

> On Thu, Mar 17, 2011 at 06:50:13PM -0700, Andrew Morton wrote:
> > On Mon, 7 Mar 2011 16:37:39 -0500 Don Zickus <dzi...@redhat.com> wrote:
> >
> > > Add a Kconfig option to allow users to set the hardlockup to panic
> > > by default. Also add in a 'nmi_watchdog=nopanic' to override this.
> > >
> >
> > Changelog forgot to tell us "why".
>
> Yeah, sorry about that.
>
> When a cpu is considered stuck, instead of limping along and just printing
> a warning, it is sometimes preferred to just panic, let kdump capture the
> vmcore and reboot. This gets the machine back into a stable state quickly
> while saving the info that got it into a stuck state to begin with.

Ah, makes sense, thanks. I updated the changelog.

>
> >
> > > Format: [state][,regs][,debounce][,die]
> > >
> > > nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
> > > - Format: [panic,][num]
> > > + Format: [panic,][nopanic,][num]
> >
> > It would be better to support panic=[0|1], if that can be simply done
> > in a back-compatible fashion.
>
> I am open to the idea, just can't figure the best way to implement that in
> a backwards compatible way.

It's not worth busting a gut over ;)

> Personally I was wondering if there were situations where you would _not_
> want it to panic. If the cpu is stuck spinning after 60 seconds, the odds
> of it freeing itself is low and you are probably stuck rebooting anyway.
>
>
> >
> > > static int __init hardlockup_panic_setup(char *str)
> > > {
> > > if (!strncmp(str, "panic", 5))
> > > hardlockup_panic = 1;
> > > + else if (!strncmp(str, "nopanic", 5))
> >
> > s/5/7/
>
> doh.
>
> I can send a refreshed patch with the above changes.

I fixed that up.

Don Zickus

unread,
Mar 18, 2011, 3:00:02 PM3/18/11
to
On Fri, Mar 18, 2011 at 11:23:53AM -0700, Andrew Morton wrote:
> >
> >
> > >
> > > > static int __init hardlockup_panic_setup(char *str)
> > > > {
> > > > if (!strncmp(str, "panic", 5))
> > > > hardlockup_panic = 1;
> > > > + else if (!strncmp(str, "nopanic", 5))
> > >
> > > s/5/7/
> >
> > doh.
> >
> > I can send a refreshed patch with the above changes.
>
> I fixed that up.

Thanks!

Cheers,
Don

0 new messages