Signed-off-by: Don Zickus <dzi...@redhat.com>
---
Forgot to cc lkml, sorry for the spam
---
Documentation/kernel-parameters.txt | 5 +++--
kernel/watchdog.c | 5 ++++-
lib/Kconfig.debug | 17 +++++++++++++++++
3 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 89835a4..ae0b499 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1577,11 +1577,12 @@ and is between 256 and 4096 characters. It is defined in the file
Format: [state][,regs][,debounce][,die]
nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
- Format: [panic,][num]
+ Format: [panic,][nopanic,][num]
Valid num: 0
0 - turn nmi_watchdog off
When panic is specified, panic when an NMI watchdog
- timeout occurs.
+ timeout occurs (or 'nopanic' to override the opposite
+ default).
This is useful when you use a panic=... timeout and
need the box quickly up again.
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 18bb157..f7c0272 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -48,12 +48,15 @@ static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
* Should we panic when a soft-lockup or hard-lockup occurs:
*/
#ifdef CONFIG_HARDLOCKUP_DETECTOR
-static int hardlockup_panic;
+static int hardlockup_panic =
+ CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE;
static int __init hardlockup_panic_setup(char *str)
{
if (!strncmp(str, "panic", 5))
hardlockup_panic = 1;
+ else if (!strncmp(str, "nopanic", 5))
+ hardlockup_panic = 0;
else if (!strncmp(str, "0", 1))
watchdog_enabled = 0;
return 1;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2b97418..80bd292 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -176,6 +176,23 @@ config HARDLOCKUP_DETECTOR
def_bool LOCKUP_DETECTOR && PERF_EVENTS && HAVE_PERF_EVENTS_NMI && \
!ARCH_HAS_NMI_WATCHDOG
+config BOOTPARAM_HARDLOCKUP_PANIC
+ bool "Panic (Reboot) On Soft Lockups"
+ depends on LOCKUP_DETECTOR
+ help
+ Say Y here to enable the kernel to panic on "hard lockups",
+ which are bugs that cause the kernel to loop in kernel
+ mode with interrupts disabled for more than 60 seconds.
+
+ Say N if unsure.
+
+config BOOTPARAM_HARDLOCKUP_PANIC_VALUE
+ int
+ depends on LOCKUP_DETECTOR
+ range 0 1
+ default 0 if !BOOTPARAM_HARDLOCKUP_PANIC
+ default 1 if BOOTPARAM_HARDLOCKUP_PANIC
+
config BOOTPARAM_SOFTLOCKUP_PANIC
bool "Panic (Reboot) On Soft Lockups"
depends on LOCKUP_DETECTOR
--
1.7.3.5
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
The second problem was when the hardlockup failed to start on boxes
(from a no lapic or bios controlled perf counter case), it reported
failure to the cpu notifier chain. This blocked the notifier from
continuing to start other more critical pieces of cpu bring-up (in
our case based on a 2.6.32 fork, it was the mce). As a result,
during soft cpu online/offline testing, the system would panic
when a cpu was offlined because the cpu notifier would succeed in
processing a watchdog disable cpu event and would panic in the mce
case as a result of un-initialized variables from a never executed
cpu up event.
I realized the hardlockup/softlockup cases are really just debugging
aids and should never impede the progress of a cpu up/down event.
Therefore I modified the code to always return NOTIFY_OK and instead
rely on printks to inform the user of problems.
Signed-off-by: Don Zickus <dzi...@redhat.com>
---
Forgot to cc lkml, sorry for the spam
---
kernel/watchdog.c | 22 ++++++++++++++++------
1 files changed, 16 insertions(+), 6 deletions(-)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index f7c0272..c52645b 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -418,19 +418,22 @@ static int watchdog_prepare_cpu(int cpu)
static int watchdog_enable(int cpu)
{
struct task_struct *p = per_cpu(softlockup_watchdog, cpu);
- int err;
+ int err = 0;
/* enable the perf event */
err = watchdog_nmi_enable(cpu);
- if (err)
- return err;
+
+ /* Regardless of err above, fall through and start softlockup */
/* create the watchdog thread */
if (!p) {
p = kthread_create(watchdog, (void *)(unsigned long)cpu, "watchdog/%d", cpu);
if (IS_ERR(p)) {
printk(KERN_ERR "softlockup watchdog for %i failed\n", cpu);
- return PTR_ERR(p);
+ if (!err)
+ /* if hardlockup hasn't already set this */
+ err = PTR_ERR(p);
+ goto out;
}
kthread_bind(p, cpu);
per_cpu(watchdog_touch_ts, cpu) = 0;
@@ -438,7 +441,8 @@ static int watchdog_enable(int cpu)
wake_up_process(p);
}
- return 0;
+out:
+ return err;
}
static void watchdog_disable(int cpu)
@@ -550,7 +554,13 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
break;
#endif /* CONFIG_HOTPLUG_CPU */
}
- return notifier_from_errno(err);
+
+ /*
+ * hardlockup and softlockup are not important enough
+ * to block cpu bring up. Just always succeed and
+ * rely on printk output to flag problems.
+ */
+ return NOTIFY_OK;
}
static struct notifier_block __cpuinitdata cpu_nfb = {
Did you mean Hard Lockups here?
Thanks,
Jack
The second problem was when the hardlockup failed to start on boxes
(from a no lapic or bios controlled perf counter case), it reported
failure to the cpu notifier chain. This blocked the notifier from
continuing to start other more critical pieces of cpu bring-up (in
our case based on a 2.6.32 fork, it was the mce). As a result,
during soft cpu online/offline testing, the system would panic
when a cpu was offlined because the cpu notifier would succeed in
processing a watchdog disable cpu event and would panic in the mce
case as a result of un-initialized variables from a never executed
cpu up event.
I realized the hardlockup/softlockup cases are really just debugging
aids and should never impede the progress of a cpu up/down event.
Therefore I modified the code to always return NOTIFY_OK and instead
rely on printks to inform the user of problems.
Signed-off-by: Don Zickus <dzi...@redhat.com>
---
--
v2:
clean up a typo
Signed-off-by: Don Zickus <dzi...@redhat.com>
---
Documentation/kernel-parameters.txt | 5 +++--
kernel/watchdog.c | 5 ++++-
lib/Kconfig.debug | 17 +++++++++++++++++
3 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 89835a4..ae0b499 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1577,11 +1577,12 @@ and is between 256 and 4096 characters. It is defined in the file
Format: [state][,regs][,debounce][,die]
nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
- Format: [panic,][num]
+ Format: [panic,][nopanic,][num]
Valid num: 0
0 - turn nmi_watchdog off
When panic is specified, panic when an NMI watchdog
- timeout occurs.
+ timeout occurs (or 'nopanic' to override the opposite
+ default).
This is useful when you use a panic=... timeout and
need the box quickly up again.
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 18bb157..f7c0272 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -48,12 +48,15 @@ static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
* Should we panic when a soft-lockup or hard-lockup occurs:
*/
#ifdef CONFIG_HARDLOCKUP_DETECTOR
-static int hardlockup_panic;
+static int hardlockup_panic =
+ CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE;
static int __init hardlockup_panic_setup(char *str)
{
if (!strncmp(str, "panic", 5))
hardlockup_panic = 1;
+ else if (!strncmp(str, "nopanic", 5))
+ hardlockup_panic = 0;
else if (!strncmp(str, "0", 1))
watchdog_enabled = 0;
return 1;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2b97418..8769533 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -176,6 +176,23 @@ config HARDLOCKUP_DETECTOR
def_bool LOCKUP_DETECTOR && PERF_EVENTS && HAVE_PERF_EVENTS_NMI && \
!ARCH_HAS_NMI_WATCHDOG
+config BOOTPARAM_HARDLOCKUP_PANIC
+ bool "Panic (Reboot) On Hard Lockups"
+ depends on LOCKUP_DETECTOR
+ help
+ Say Y here to enable the kernel to panic on "hard lockups",
+ which are bugs that cause the kernel to loop in kernel
+ mode with interrupts disabled for more than 60 seconds.
+
+ Say N if unsure.
+
+config BOOTPARAM_HARDLOCKUP_PANIC_VALUE
+ int
+ depends on LOCKUP_DETECTOR
+ range 0 1
+ default 0 if !BOOTPARAM_HARDLOCKUP_PANIC
+ default 1 if BOOTPARAM_HARDLOCKUP_PANIC
+
config BOOTPARAM_SOFTLOCKUP_PANIC
bool "Panic (Reboot) On Soft Lockups"
depends on LOCKUP_DETECTOR
Yes. Thanks! :-)
Cheers,
Don
Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>
Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>
> Add a Kconfig option to allow users to set the hardlockup to panic by
> default. Also add in a 'nmi_watchdog=nopanic' to override this.
>
> v2:
> clean up a typo
>
> Signed-off-by: Don Zickus <dzi...@redhat.com>
Looks good, I believe this is good for kdump.
Reviewed-by: WANG Cong <xiyou.w...@gmail.com>
Thanks.
> This patch addresses a couple of problems. One was the case when the
> hardlockup failed to start, it also failed to start the softlockup.
> There were valid cases when the hardlockup shouldn't start and that
> shouldn't block the softlockup (no lapic, bios controls perf counters).
>
> The second problem was when the hardlockup failed to start on boxes
> (from a no lapic or bios controlled perf counter case), it reported
> failure to the cpu notifier chain. This blocked the notifier from
> continuing to start other more critical pieces of cpu bring-up (in our
> case based on a 2.6.32 fork, it was the mce). As a result, during soft
> cpu online/offline testing, the system would panic when a cpu was
> offlined because the cpu notifier would succeed in processing a watchdog
> disable cpu event and would panic in the mce case as a result of
> un-initialized variables from a never executed cpu up event.
What I saw is microcode, its /sys entries failed to come up and this
triggers a warning when these entries are removed when the CPU became
offline again.
>
> I realized the hardlockup/softlockup cases are really just debugging
> aids and should never impede the progress of a cpu up/down event.
> Therefore I modified the code to always return NOTIFY_OK and instead
> rely on printks to inform the user of problems.
>
Yeah, it should also fix the problem I saw.
Reviewed-by: WANG Cong <xiyou.w...@gmail.com>
Thanks.
--
> Add a Kconfig option to allow users to set the hardlockup to panic
> by default. Also add in a 'nmi_watchdog=nopanic' to override this.
>
Changelog forgot to tell us "why".
> Format: [state][,regs][,debounce][,die]
>
> nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
> - Format: [panic,][num]
> + Format: [panic,][nopanic,][num]
It would be better to support panic=[0|1], if that can be simply done
in a back-compatible fashion.
> static int __init hardlockup_panic_setup(char *str)
> {
> if (!strncmp(str, "panic", 5))
> hardlockup_panic = 1;
> + else if (!strncmp(str, "nopanic", 5))
s/5/7/
> + hardlockup_panic = 0;
> else if (!strncmp(str, "0", 1))
> watchdog_enabled = 0;
> return 1;
--
Yeah, sorry about that.
When a cpu is considered stuck, instead of limping along and just printing
a warning, it is sometimes preferred to just panic, let kdump capture the
vmcore and reboot. This gets the machine back into a stable state quickly
while saving the info that got it into a stuck state to begin with.
>
> > Format: [state][,regs][,debounce][,die]
> >
> > nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
> > - Format: [panic,][num]
> > + Format: [panic,][nopanic,][num]
>
> It would be better to support panic=[0|1], if that can be simply done
> in a back-compatible fashion.
I am open to the idea, just can't figure the best way to implement that in
a backwards compatible way.
Personally I was wondering if there were situations where you would _not_
want it to panic. If the cpu is stuck spinning after 60 seconds, the odds
of it freeing itself is low and you are probably stuck rebooting anyway.
>
> > static int __init hardlockup_panic_setup(char *str)
> > {
> > if (!strncmp(str, "panic", 5))
> > hardlockup_panic = 1;
> > + else if (!strncmp(str, "nopanic", 5))
>
> s/5/7/
doh.
I can send a refreshed patch with the above changes.
Cheers,
Don
> On Thu, Mar 17, 2011 at 06:50:13PM -0700, Andrew Morton wrote:
> > On Mon, 7 Mar 2011 16:37:39 -0500 Don Zickus <dzi...@redhat.com> wrote:
> >
> > > Add a Kconfig option to allow users to set the hardlockup to panic
> > > by default. Also add in a 'nmi_watchdog=nopanic' to override this.
> > >
> >
> > Changelog forgot to tell us "why".
>
> Yeah, sorry about that.
>
> When a cpu is considered stuck, instead of limping along and just printing
> a warning, it is sometimes preferred to just panic, let kdump capture the
> vmcore and reboot. This gets the machine back into a stable state quickly
> while saving the info that got it into a stuck state to begin with.
Ah, makes sense, thanks. I updated the changelog.
>
> >
> > > Format: [state][,regs][,debounce][,die]
> > >
> > > nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
> > > - Format: [panic,][num]
> > > + Format: [panic,][nopanic,][num]
> >
> > It would be better to support panic=[0|1], if that can be simply done
> > in a back-compatible fashion.
>
> I am open to the idea, just can't figure the best way to implement that in
> a backwards compatible way.
It's not worth busting a gut over ;)
> Personally I was wondering if there were situations where you would _not_
> want it to panic. If the cpu is stuck spinning after 60 seconds, the odds
> of it freeing itself is low and you are probably stuck rebooting anyway.
>
>
> >
> > > static int __init hardlockup_panic_setup(char *str)
> > > {
> > > if (!strncmp(str, "panic", 5))
> > > hardlockup_panic = 1;
> > > + else if (!strncmp(str, "nopanic", 5))
> >
> > s/5/7/
>
> doh.
>
> I can send a refreshed patch with the above changes.
I fixed that up.
Thanks!
Cheers,
Don