Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[PATCH 0/3 v2] new nmi_watchdog using perf events

0 views

Skip to first unread message

Don Zickus

unread,

Feb 5, 2010, 9:50:02 PM2/5/10

This patch series tries to implement a new nmi_watchdog using the perf
events subsystem. I am posting this series early for feedback on the
approach. It isn't feature compatible with the old nmi_watchdog yet, nor
does it have all the old workarounds either.

The basic design is to create an in-kernel performance counter that goes off
every few seconds and checks for cpu lockups. It is fairly straight
forward. Some of the quirks are making sure the cpu lockup detection works
correctly.

It has been lightly tested for now. Once people are ok with the approach,
I'll expand testing to more machines in our lab.

I tried taking a generic approach so all arches could use it if they want
and then implement some per arch specific hooks. I believe this is what
Ingo was suggesting. The initial work is based off of kernel/softlockup.c.

Any feedback would be great.

v2:
- moved a notify_die call into a #ifdef block
- used better default values for configuring the nmi_watchdog based on
the old nmi_watchdog values

Cheers,
Don

--
damn it forgot to cc lkml

Don Zickus (3):
[RFC][x86] move notify_die from nmi.c to traps.c
[RFC] nmi_watchdog: new implementation using perf events
[RFC] nmi_watchdog: config option to enable new nmi_watchdog

arch/x86/kernel/apic/Makefile | 7 ++-
arch/x86/kernel/apic/hw_nmi.c | 114 ++++++++++++++++++++++++
arch/x86/kernel/apic/nmi.c | 7 --
arch/x86/kernel/traps.c | 7 ++
include/linux/nmi.h | 4 +
kernel/Makefile | 1 +
kernel/nmi_watchdog.c | 196 +++++++++++++++++++++++++++++++++++++++++
lib/Kconfig.debug | 13 +++
8 files changed, 341 insertions(+), 8 deletions(-)
create mode 100644 arch/x86/kernel/apic/hw_nmi.c
create mode 100644 kernel/nmi_watchdog.c

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Don Zickus

unread,

Feb 5, 2010, 9:50:01 PM2/5/10

In order to handle a new nmi_watchdog approach, I need to move the notify_die()
routine out of nmi_watchdog_tick() and into default_do_nmi(). This lets me easily
swap out the old nmi_watchdog with the new one with just a config change.

The change probably makes sense from a high level perspective because the
nmi_watchdog shouldn't be handling notify_die routines anyway. However, this
move does change the semantics a little bit. Instead of checking on every nmi
interrupt if the cpus are stuck, only check them on the nmi_watchdog interrupts.

v2: move notify_die call into #idef block

Signed-off-by: Don Zickus <dzi...@redhat.com>
---
arch/x86/kernel/apic/nmi.c | 7 -------
arch/x86/kernel/traps.c | 5 +++++
2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/apic/nmi.c b/arch/x86/kernel/apic/nmi.c
index 0159a69..5d47682 100644
--- a/arch/x86/kernel/apic/nmi.c
+++ b/arch/x86/kernel/apic/nmi.c
@@ -400,13 +400,6 @@ nmi_watchdog_tick(struct pt_regs *regs, unsigned reason)
int cpu = smp_processor_id();
int rc = 0;

- /* check for other users first */
- if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)
- == NOTIFY_STOP) {
- rc = 1;
- touched = 1;
- }
-
sum = get_timer_irqs(cpu);

if (__get_cpu_var(nmi_touch)) {
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 3339917..3be4687 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -400,7 +400,12 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
if (notify_die(DIE_NMI_IPI, "nmi_ipi", regs, reason, 2, SIGINT)
== NOTIFY_STOP)
return;
+
#ifdef CONFIG_X86_LOCAL_APIC
+ if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)
+ == NOTIFY_STOP)
+ return;
+
/*
* Ok, so this is none of the documented NMI sources,
* so it must be the NMI watchdog.
--
1.6.6.83.gc9a2

Don Zickus

unread,

Feb 5, 2010, 9:50:02 PM2/5/10

This is a new generic nmi_watchdog implementation using the perf events
infrastructure as suggested by Ingo.

The implementation is simple, just create an in-kernel perf event and register
an overflow handler to check for cpu lockups. I created a generic implementation
that lives in kernel/ and the hardware specific part that for now lives in arch/x86.

I have done light testing to make sure the framework works correctly and it does.

v2: sets the correct timeout values based on the old nmi watchdog

Signed-off-by: Don Zickus <dzi...@redhat.com>
---

arch/x86/kernel/apic/hw_nmi.c | 114 ++++++++++++++++++++++++
kernel/nmi_watchdog.c | 191 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 305 insertions(+), 0 deletions(-)

create mode 100644 arch/x86/kernel/apic/hw_nmi.c
create mode 100644 kernel/nmi_watchdog.c

diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
new file mode 100644
index 0000000..8c0e6a4
--- /dev/null
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -0,0 +1,114 @@
+/*
+ * HW NMI watchdog support
+ *
+ * started by Don Zickus, Copyright (C) 2010 Red Hat, Inc.
+ *
+ * Arch specific calls to support NMI watchdog
+ *
+ * Bits copied from original nmi.c file
+ *
+ */
+
+#include <asm/apic.h>
+#include <linux/smp.h>
+#include <linux/cpumask.h>
+#include <linux/sched.h>
+#include <linux/percpu.h>
+#include <linux/cpumask.h>
+#include <linux/kernel_stat.h>
+#include <asm/mce.h>
+
+#include <linux/nmi.h>
+#include <linux/module.h>
+
+/* For reliability, we're prepared to waste bits here. */
+static DECLARE_BITMAP(backtrace_mask, NR_CPUS) __read_mostly;
+
+static DEFINE_PER_CPU(unsigned, last_irq_sum);
+
+/*
+ * Take the local apic timer and PIT/HPET into account. We don't
+ * know which one is active, when we have highres/dyntick on
+ */
+static inline unsigned int get_timer_irqs(int cpu)
+{
+ return per_cpu(irq_stat, cpu).apic_timer_irqs +
+ per_cpu(irq_stat, cpu).irq0_irqs;
+}
+
+static inline int mce_in_progress(void)
+{
+#if defined(CONFIG_X86_MCE)
+ return atomic_read(&mce_entry) > 0;
+#endif
+ return 0;
+}
+
+int hw_nmi_is_cpu_stuck(struct pt_regs *regs)
+{
+ unsigned int sum;
+ int cpu = smp_processor_id();
+
+ /* FIXME: cheap hack for this check, probably should get its own
+ * die_notifier handler
+ */
+ if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
+ static DEFINE_SPINLOCK(lock); /* Serialise the printks */
+
+ spin_lock(&lock);
+ printk(KERN_WARNING "NMI backtrace for cpu %d\n", cpu);
+ show_regs(regs);
+ dump_stack();
+ spin_unlock(&lock);
+ cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
+ }
+
+ /* if we are doing an mce, just assume the cpu is not stuck */
+ /* Could check oops_in_progress here too, but it's safer not to */
+ if (mce_in_progress())
+ return 0;
+
+ /* We determine if the cpu is stuck by checking whether any
+ * interrupts have happened since we last checked. Of course
+ * an nmi storm could create false positives, but the higher
+ * level logic should account for that
+ */
+ sum = get_timer_irqs(cpu);
+ if (__get_cpu_var(last_irq_sum) == sum) {
+ return 1;
+ } else {
+ __get_cpu_var(last_irq_sum) = sum;
+ return 0;
+ }
+}
+
+void arch_trigger_all_cpu_backtrace(void)
+{
+ int i;
+
+ cpumask_copy(to_cpumask(backtrace_mask), cpu_online_mask);
+
+ printk(KERN_INFO "sending NMI to all CPUs:\n");
+ apic->send_IPI_all(NMI_VECTOR);
+
+ /* Wait for up to 10 seconds for all CPUs to do the backtrace */
+ for (i = 0; i < 10 * 1000; i++) {
+ if (cpumask_empty(to_cpumask(backtrace_mask)))
+ break;
+ mdelay(1);
+ }
+}
+
+/* STUB calls to mimic old nmi_watchdog behaviour */
+unsigned int nmi_watchdog = NMI_NONE;
+EXPORT_SYMBOL(nmi_watchdog);
+atomic_t nmi_active = ATOMIC_INIT(0); /* oprofile uses this */
+EXPORT_SYMBOL(nmi_active);
+int nmi_watchdog_enabled;
+int unknown_nmi_panic;
+void cpu_nmi_set_wd_enabled(void) { return; }
+void acpi_nmi_enable(void) { return; }
+void acpi_nmi_disable(void) { return; }
+void stop_apic_nmi_watchdog(void *unused) { return; }
+void setup_apic_nmi_watchdog(void *unused) { return; }
+int __init check_nmi_watchdog(void) { return 0; }
diff --git a/kernel/nmi_watchdog.c b/kernel/nmi_watchdog.c
new file mode 100644
index 0000000..36817b2
--- /dev/null
+++ b/kernel/nmi_watchdog.c
@@ -0,0 +1,191 @@
+/*
+ * Detect Hard Lockups using the NMI
+ *
+ * started by Don Zickus, Copyright (C) 2010 Red Hat, Inc.
+ *
+ * this code detects hard lockups: incidents in where on a CPU
+ * the kernel does not respond to anything except NMI.
+ *
+ * Note: Most of this code is borrowed heavily from softlockup.c,
+ * so thanks to Ingo for the initial implementation.
+ * Some chunks also taken from arch/x86/kernel/apic/nmi.c, thanks
+ * to those contributors as well.
+ */
+
+#include <linux/mm.h>
+#include <linux/cpu.h>
+#include <linux/nmi.h>
+#include <linux/init.h>
+#include <linux/delay.h>
+#include <linux/freezer.h>
+#include <linux/lockdep.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+#include <linux/sysctl.h>
+
+#include <asm/irq_regs.h>
+#include <linux/perf_event.h>
+
+static DEFINE_PER_CPU(struct perf_event *, nmi_watchdog_ev);
+static DEFINE_PER_CPU(int, nmi_watchdog_touch);
+static DEFINE_PER_CPU(long, alert_counter);
+
+void touch_nmi_watchdog(void)
+{
+ __raw_get_cpu_var(nmi_watchdog_touch) = 1;
+ touch_softlockup_watchdog();
+}
+EXPORT_SYMBOL(touch_nmi_watchdog);
+
+void touch_all_nmi_watchdog(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ per_cpu(nmi_watchdog_touch, cpu) = 1;
+ touch_softlockup_watchdog();
+}
+
+#ifdef CONFIG_SYSCTL
+/*
+ * proc handler for /proc/sys/kernel/nmi_watchdog
+ */
+int proc_nmi_enabled(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int cpu;
+
+ if (per_cpu(nmi_watchdog_ev, smp_processor_id()) == NULL)
+ nmi_watchdog_enabled = 0;
+ else
+ nmi_watchdog_enabled = 1;
+
+ touch_all_nmi_watchdog();
+ proc_dointvec(table, write, buffer, length, ppos);
+ if (nmi_watchdog_enabled)
+ for_each_online_cpu(cpu)
+ perf_event_enable(per_cpu(nmi_watchdog_ev, cpu));
+ else
+ for_each_online_cpu(cpu)
+ perf_event_disable(per_cpu(nmi_watchdog_ev, cpu));
+ return 0;
+}
+
+#endif /* CONFIG_SYSCTL */
+
+struct perf_event_attr wd_attr = {
+ .type = PERF_TYPE_HARDWARE,
+ .config = PERF_COUNT_HW_CPU_CYCLES,
+ .size = sizeof(struct perf_event_attr),
+ .pinned = 1,
+ .disabled = 1,
+};
+
+static int panic_on_timeout;
+
+void wd_overflow(struct perf_event *event, int nmi,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ int cpu = smp_processor_id();
+ int touched = 0;
+
+ if (__get_cpu_var(nmi_watchdog_touch)) {
+ per_cpu(nmi_watchdog_touch, cpu) = 0;
+ touched = 1;
+ }
+
+ /* check to see if the cpu is doing anything */
+ if (!touched && hw_nmi_is_cpu_stuck(regs)) {
+ /*
+ * Ayiee, looks like this CPU is stuck ...
+ * wait a few IRQs (5 seconds) before doing the oops ...
+ */
+ per_cpu(alert_counter,cpu) += 1;
+ if (per_cpu(alert_counter,cpu) == 5) {
+ /*
+ * die_nmi will return ONLY if NOTIFY_STOP happens..
+ */
+ die_nmi("BUG: NMI Watchdog detected LOCKUP",
+ regs, panic_on_timeout);
+ }
+ } else {
+ per_cpu(alert_counter,cpu) = 0;
+ }
+
+ return;
+}
+
+/*
+ * Create/destroy watchdog threads as CPUs come and go:
+ */
+static int __cpuinit
+cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+ int hotcpu = (unsigned long)hcpu;
+ struct perf_event *event;
+
+ switch (action) {
+ case CPU_UP_PREPARE:
+ case CPU_UP_PREPARE_FROZEN:
+ per_cpu(nmi_watchdog_touch, hotcpu) = 0;
+ break;
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ /* originally wanted the below chunk to be in CPU_UP_PREPARE, but caps is unpriv for non-CPU0 */
+ wd_attr.sample_period = cpu_khz * 1000;
+ event = perf_event_create_kernel_counter(&wd_attr, hotcpu, -1, wd_overflow);
+ if (IS_ERR(event)) {
+ printk(KERN_ERR "nmi watchdog failed to create perf event on %i: %p\n", hotcpu, event);
+ return NOTIFY_BAD;
+ }
+ per_cpu(nmi_watchdog_ev, hotcpu) = event;
+ perf_event_enable(per_cpu(nmi_watchdog_ev, hotcpu));
+ break;
+#ifdef CONFIG_HOTPLUG_CPU
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ perf_event_disable(per_cpu(nmi_watchdog_ev, hotcpu));
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ event = per_cpu(nmi_watchdog_ev, hotcpu);
+ per_cpu(nmi_watchdog_ev, hotcpu) = NULL;
+ perf_event_release_kernel(event);
+ break;
+#endif /* CONFIG_HOTPLUG_CPU */
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata cpu_nfb = {
+ .notifier_call = cpu_callback
+};
+
+static int __initdata nonmi_watchdog;
+
+static int __init nonmi_watchdog_setup(char *str)
+{
+ nonmi_watchdog = 1;
+ return 1;
+}
+__setup("nonmi_watchdog", nonmi_watchdog_setup);
+
+static int __init spawn_nmi_watchdog_task(void)
+{
+ void *cpu = (void *)(long)smp_processor_id();
+ int err;
+
+ if (nonmi_watchdog)
+ return 0;
+
+ err = cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
+ if (err == NOTIFY_BAD) {
+ BUG();
+ return 1;
+ }
+ cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
+ register_cpu_notifier(&cpu_nfb);
+
+ return 0;
+}
+early_initcall(spawn_nmi_watchdog_task);
--
1.6.6.83.gc9a2

tip-bot for Don Zickus

unread,

Feb 8, 2010, 4:00:02 AM2/8/10

Commit-ID: 1fb9d6ad2766a1dd70d167552988375049a97f21
Gitweb: http://git.kernel.org/tip/1fb9d6ad2766a1dd70d167552988375049a97f21
Author: Don Zickus <dzi...@redhat.com>
AuthorDate: Fri, 5 Feb 2010 21:47:04 -0500
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Mon, 8 Feb 2010 08:29:02 +0100

nmi_watchdog: Add new, generic implementation, using perf events

This is a new generic nmi_watchdog implementation using the perf
events infrastructure as suggested by Ingo.

The implementation is simple, just create an in-kernel perf
event and register an overflow handler to check for cpu lockups.

I created a generic implementation that lives in kernel/ and
the hardware specific part that for now lives in arch/x86.

This approach has a number of advantages:

- It simplifies the x86 PMU implementation in the long run,
in that it removes the hardcoded low-level PMU implementation
that was the NMI watchdog before.

- It allows new NMI watchdog features to be added in a central
place.

- It allows other architectures to enable the NMI watchdog,
as long as they have perf events (that provide NMIs)
implemented.

- It also allows for more graceful co-existence of existing
perf events apps and the NMI watchdog - before these changes
the relationship was exclusive. (The NMI watchdog will 'spend'
a perf event when enabled. In later iterations we might be
able to piggyback from an existing NMI event without having
to allocate a hardware event for the NMI watchdog - turning
this into a no-hardware-cost feature.)

As for compatibility, we'll keep the old NMI watchdog code as
well until the new one can 100% replace it on all CPUs, old and
new alike. That might take some time as the NMI watchdog has
been ported to many CPU models.

I have done light testing to make sure the framework works
correctly and it does.

v2: Set the correct timeout values based on the old nmi
watchdog

Signed-off-by: Don Zickus <dzi...@redhat.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Andrew Morton <ak...@linux-foundation.org>
Cc: gorc...@gmail.com
Cc: ar...@redhat.com
Cc: pet...@infradead.org
LKML-Reference: <1265424425-31562-3-g...@redhat.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>

---
arch/x86/kernel/apic/hw_nmi.c | 114 ++++++++++++++++++++++++
kernel/nmi_watchdog.c | 191 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 305 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c

tip-bot for Don Zickus

unread,

Feb 8, 2010, 4:00:02 AM2/8/10

Commit-ID: e40b17208b6805be50ffe891878662b6076206b9
Gitweb: http://git.kernel.org/tip/e40b17208b6805be50ffe891878662b6076206b9
Author: Don Zickus <dzi...@redhat.com>
AuthorDate: Fri, 5 Feb 2010 21:47:03 -0500

Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Mon, 8 Feb 2010 08:29:02 +0100

x86: Move notify_die from nmi.c to traps.c

In order to handle a new nmi_watchdog approach, I need to move
the notify_die() routine out of nmi_watchdog_tick() and into
default_do_nmi(). This lets me easily swap out the old
nmi_watchdog with the new one with just a config change.

The change probably makes sense from a high level perspective
because the nmi_watchdog shouldn't be handling notify_die
routines anyway. However, this move does change the semantics a
little bit. Instead of checking on every nmi interrupt if the
cpus are stuck, only check them on the nmi_watchdog interrupts.

v2: Move notify_die call into #idef block

Signed-off-by: Don Zickus <dzi...@redhat.com>

Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Andrew Morton <ak...@linux-foundation.org>
Cc: gorc...@gmail.com
Cc: ar...@redhat.com
Cc: pet...@infradead.org

LKML-Reference: <1265424425-31562-2-g...@redhat.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>

---
arch/x86/kernel/apic/nmi.c | 7 -------
arch/x86/kernel/traps.c | 5 +++++
2 files changed, 5 insertions(+), 7 deletions(-)

index 1168e44..51ef893 100644

--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -400,7 +400,12 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
if (notify_die(DIE_NMI_IPI, "nmi_ipi", regs, reason, 2, SIGINT)
== NOTIFY_STOP)
return;
+
#ifdef CONFIG_X86_LOCAL_APIC
+ if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)
+ == NOTIFY_STOP)
+ return;
+
/*
* Ok, so this is none of the documented NMI sources,
* so it must be the NMI watchdog.
--

0 new messages