x86/mce: machine check warning during poweroff

Djalal Harouni

unread,

Jan 10, 2012, 7:00:51 PM1/10/12

to Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org

Today's pull from Linus' tree shows a warning during poweroff, the
message is related to the machinecheck.
The drivers/base/core.c:device_release() did not find the registred
release() function.

This kernel is used for development and it's running under KVM/Qemu, so
if you need further information or tests let me know.

Qemu is simulating 2 CPUs.

Thanks.

[ 1879.944193] ------------[ cut here ]------------
[ 1879.950488] WARNING: at drivers/base/core.c:194 device_release+0x82/0x90()
[ 1879.959424] Hardware name: Bochs
[ 1879.964714] Device 'machinecheck1' does not have a release() function, it is broken and must be fixed.
[ 1879.977354] Modules linked in:
[ 1879.979704] Pid: 1738, comm: halt Not tainted 3.2.0-minimal-kvm-05692-g1c81065-dirty #41
[ 1879.989093] Call Trace:
[ 1879.992729] [<ffffffff8103952a>] warn_slowpath_common+0x7a/0xb0
[ 1879.999308] [<ffffffff81039601>] warn_slowpath_fmt+0x41/0x50
[ 1880.005463] [<ffffffff8172b022>] device_release+0x82/0x90
[ 1880.012915] [<ffffffff81601667>] kobject_release+0x47/0x90
[ 1880.019107] [<ffffffff8160152c>] kobject_put+0x2c/0x60
[ 1880.024269] [<ffffffff8172acc2>] put_device+0x12/0x20
[ 1880.031254] [<ffffffff8172ba19>] device_unregister+0x19/0x20
[ 1880.038594] [<ffffffff81afb49d>] mce_cpu_callback+0xea/0x18b
[ 1880.043389] [<ffffffff81b08924>] notifier_call_chain+0x64/0xf0
[ 1880.051928] [<ffffffff81066c89>] __raw_notifier_call_chain+0x9/0x10
[ 1880.059077] [<ffffffff8103b50b>] __cpu_notify+0x1b/0x30
[ 1880.063894] [<ffffffff8103b530>] cpu_notify_nofail+0x10/0x20
[ 1880.071952] [<ffffffff81ae27dd>] _cpu_down+0x11d/0x2c0
[ 1880.078534] [<ffffffff81b01235>] ? printk+0x3c/0x3e
[ 1880.082662] [<ffffffff8103b7cb>] disable_nonboot_cpus+0x8b/0x110
[ 1880.091129] [<ffffffff81053f21>] kernel_power_off+0x21/0x50
[ 1880.098420] [<ffffffff81054220>] sys_reboot+0x110/0x220
[ 1880.104098] [<ffffffff8108efdd>] ? trace_hardirqs_on+0xd/0x10
[ 1880.112006] [<ffffffff81b04deb>] ? _raw_spin_unlock_irq+0x2b/0x50
[ 1880.119181] [<ffffffff8106dc0d>] ? finish_task_switch+0x8d/0x1a0
[ 1880.126741] [<ffffffff8106dbce>] ? finish_task_switch+0x4e/0x1a0
[ 1880.134793] [<ffffffff81b02f0b>] ? __schedule+0x3db/0x890
[ 1880.140510] [<ffffffff81b0cfc7>] ? sysret_check+0x1b/0x56
[ 1880.148101] [<ffffffff8160d33e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1880.156706] [<ffffffff81b0cfa2>] system_call_fastpath+0x16/0x1b
[ 1880.162885] ---[ end trace d8faf9d3af9f23e8 ]---
[ 1880.171148] Power down.

$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 2
model name : QEMU Virtual CPU version 0.12.5
stepping : 3
microcode : 0x1
cpu MHz : 2800.047
cache size : 4096 KB
fpu : yes
fpu_exception : yes
cpuid level : 4
wp : yes
flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni cx16
tsc_deadlin _timer hypervisor lahf_lm
bogomips : 5600.09
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

--
tixxdz
http://opendz.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ming Lei

unread,

Jan 12, 2012, 9:22:25 AM1/12/12

to Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org

Hi,

I saw the warning too during S2R.

thanks,
--
Ming Lei

--
Ming Lei

Srivatsa S. Bhat

unread,

Jan 13, 2012, 3:22:54 PM1/13/12

to Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Linus Torvalds, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua

On 01/12/2012 07:52 PM, Ming Lei wrote:

> Hi,
>
> I saw the warning too during S2R.
>

>

Fundamentally, this warning is triggered during CPU Offline, which is done
during poweroff, suspend, hibernate etc. IOW, even a simple
# echo 0 > /sys/devices/system/cpu/cpuX/online will trigger it.

Some discussion about this warning and a probable fix is going on in this
thread: https://lkml.org/lkml/2012/1/13/278

[And there have been reports of Suspend/Hibernate not working in recent
kernels (3.3 merge window)]

However, it is to be noted that, technically this warning (machinecheck1
not having a release() function) is not all that new. Just that people
didn't probably notice it earlier (reason explained below).

Prior to the 3.3 merge window (when everything was fine, particularly
suspend/resume), upon a CPU offline, we used to get the following message:

Broke affinity for irq 49
Broke affinity for irq 87
CPU 1 is now offline
kobject:kobject: 'index0' (ffff8802764e5c00): does not have a release() function, it is broken and must be fixed.
kobject:kobject: 'index1' (ffff8802764e5c48): does not have a release() function, it is broken and must be fixed.
kobject:kobject: 'index2' (ffff8802764e5c90): does not have a release() function, it is broken and must be fixed.
kobject:kobject: 'index3' (ffff8802764e5cd8): does not have a release() function, it is broken and must be fixed.
kobject:kobject: 'cache' (ffff88027926c480): does not have a release() function, it is broken and must be fixed.
kobject:kobject: 'machinecheck1' (ffff88002822d8f0): does not have a release() function, it is broken and must be fixed.
^^^^^^^^^
This is from the kobject_cleanup() function defined in lib/kobject.c. Since
pr_debug() was used for printing, it made this kind of obscure.

After commit 8a25a2fd (cpu: convert 'cpu' and 'machinecheck' sysdev_class to
a regular subsystem), the callpaths changed and we now hit the rather strong
looking WARN() in drivers/base/core.c:device_release(), which is why it is
getting everyone's attention now.

So, in the recent kernels (3.3 merge window), we get:

(Note the difference in the kobject line about machinecheck)

[46407.738415] kobject: 'cpufreq' (ffff88026f794098): calling ktype release
[46407.752649] CPU 1 is now offline
[46407.757002] kobject: 'index0' (ffff88026f0cac00): does not have a release() function, it is broken and must be fixed.
[46407.769302] kobject: 'index1' (ffff88026f0cac48): does not have a release() function, it is broken and must be fixed.
[46407.781412] kobject: 'index2' (ffff88026f0cac90): does not have a release() function, it is broken and must be fixed.
[46407.793480] kobject: 'index3' (ffff88026f0cacd8): does not have a release() function, it is broken and must be fixed.
[46407.805547] kobject: 'cache' (ffff880272e0d3c0): does not have a release() function, it is broken and must be fixed.
[46407.817906] kobject: 'machinecheck1' (ffff88027fc2cb70): calling ktype release
[46407.826182] ------------[ cut here ]------------
[46407.831514] WARNING: at drivers/base/core.c:194 device_release+0x82/0x90()
[46407.831515] Hardware name: IBM System X iDataPlex dx360 M4 Server -[7912AC1]-
[46407.831517] Device 'machinecheck1' does not have a release() function, it is broken and must be fixed.

IOW, the warning about machinecheck has just been moved from one place to
another.

My only point here is that we have essentially seen this warning before
when suspend/resume was working fine. And it has been reported that
suspend/resume works fine if CONFIG_X86_MCE is not set. So I guess something
else is wrong somewhere.. IOW, I feel whether or not machinecheck has a
release function doesn't really matter that much for suspend/resume to get
any better.

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center

Justin P. Mattock

unread,

Jan 13, 2012, 3:34:27 PM1/13/12

to Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Linus Torvalds, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, Jeff Chua

well I dont care much for the message since its a warning message(should
be fixed though), its when the machine froze. maybe I hit something else
other than this warning. I can try doing some more suspending to see if
this freeze shows up and try to capture syslog or image then post it.

Justin P. Mattock

Linus Torvalds

unread,

Jan 13, 2012, 3:37:53 PM1/13/12

to Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua

On Fri, Jan 13, 2012 at 12:22 PM, Srivatsa S. Bhat
<srivat...@linux.vnet.ibm.com> wrote:
> However, it is to be noted that, technically this warning (machinecheck1
> not having a release() function) is not all that new. Just that people
> didn't probably notice it earlier (reason explained below).

.. but there really is some semantic issue going on here. I'm not the
only one for whom suspend/resume simply does not work with MCE
enabled, and that "not work" coincides with the switch-over from
sysdev to struct device.

I suspect it is some subtle difference in sysdev_unregister() vs
device_unregister(). The message change is just a symptom of the fact
that we go through different paths.

For me MCE hangs only on the *second* suspend, but it sounds like
others see it on the first one.

Linus

Srivatsa S. Bhat

unread,

Jan 13, 2012, 3:53:30 PM1/13/12

to Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua

On 01/14/2012 02:07 AM, Linus Torvalds wrote:

> On Fri, Jan 13, 2012 at 12:22 PM, Srivatsa S. Bhat
> <srivat...@linux.vnet.ibm.com> wrote:
>> However, it is to be noted that, technically this warning (machinecheck1
>> not having a release() function) is not all that new. Just that people
>> didn't probably notice it earlier (reason explained below).
>
> .. but there really is some semantic issue going on here. I'm not the
> only one for whom suspend/resume simply does not work with MCE
> enabled, and that "not work" coincides with the switch-over from
> sysdev to struct device.
>

True. Even Marcos Souza reported that suspend/resume doesn't work with
MCE enabled.

> I suspect it is some subtle difference in sysdev_unregister() vs
> device_unregister(). The message change is just a symptom of the fact
> that we go through different paths.
>
> For me MCE hangs only on the *second* suspend, but it sounds like
> others see it on the first one.
>

Wait a minute, did you mention "second attempt"? I think I have something
interesting..

I was trying out the pm_test framework (Documentation/power/basic-pm-
debugging.txt) at the processors level (in which CPUs are offlined).
First attempt succeeded. Second attempt was reproducibly problematic.
And this was with MCE enabled.

Here is the log (covers both first and second attempts at processors level):

[ 71.427970] PM: Hibernation mode set to 'reboot'

[ 109.588229] PM: Marking nosave pages: 000000000009c000 - 0000000000100000

[ 109.595195] PM: Marking nosave pages: 000000007ce5e000 - 000000007cf22000

[ 109.602154] PM: Marking nosave pages: 000000007d6ae000 - 000000007d75e000

[ 109.609111] PM: Marking nosave pages: 000000007f68f000 - 000000007f7ff000

[ 109.616075] PM: Marking nosave pages: 000000007f800000 - 0000000100000000

[ 109.626743] PM: Basic memory bitmaps created

[ 109.631179] PM: Syncing filesystems ... done.

[ 109.662279] Freezing user space processes ... (elapsed 0.01 seconds) done.

[ 109.687212] PM: Preallocating image memory... done (allocated 506656 pages)

[ 112.056392] PM: Allocated 2026624 kbytes in 2.36 seconds (858.73 MB/s)

[ 112.063046] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.

8.796863] ioatdma 0000:00:16.6: setting latency timer to 64

[ 118.796866] uhci_hcd 0000:00:1a.0: setting latency timer to 64

[ 118.796870] ioatdma 0000:00:16.7: setting latency timer to 64

[ 118.796891] ehci_hcd 0000:00:1a.7: setting latency timer to 64

[ 118.796916] uhci_hcd 0000:00:1d.0: setting latency timer to 64

[ 118.796932] usb usb3: root hub lost power or was reset

[ 118.796952] usb usb4: root hub lost power or was reset

[ 118.796954] uhci_hcd 0000:00:1d.1: setting latency timer to 64

[ 118.796975] uhci_hcd 0000:00:1d.2: setting latency timer to 64

[ 118.796988] usb usb5: root hub lost power or was reset

[ 118.797011] ehci_hcd 0000:00:1d.7: setting latency timer to 64

[ 118.797014] usb usb6: root hub lost power or was reset

[ 118.797035] pci 0000:00:1e.0: setting latency timer to 64

[ 118.797055] mptbase: ioc0: pci-resume: pdev=0xffff8808ca39b000, slot=0000:0b:00.0, Previous operating state [D0]

[ 118.797069] mptsas 0000:0b:00.0: PCI INT A -> GSI 28 (level, low) -> IRQ 28

[ 118.797086] usb usb1: root hub lost power or was reset

[ 118.797132] usb usb2: root hub lost power or was reset

[ 118.797155] mptbase: ioc0: pci-resume: ioc-state=0x1,doorbell=0x10000000

[ 118.797487] pci 0000:06:00.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20

[ 118.797498] pci 0000:06:00.0: setting latency timer to

SOL session is temporarily not available.

Attempting to re-establish your session ...

SOL connection is re-established. SOL session is now active.

[ 135.087886] mptbase: ioc0: pci-resume: success

[ 135.093612] Extended CMOS year: 2000

[ 135.097667] PM: restore of devices complete after 16301.354 msecs

[ 135.105915] PM: Image restored successfully.

[ 135.110467] Restarting tasks ... done.

[ 135.116548] PM: Basic memory bitmaps freed

[ 169.935234] PM: Marking nosave pages: 000000000009c000 - 0000000000100000

[ 169.942263] PM: Marking nosave pages: 000000007ce5e000 - 000000007cf22000

[ 169.949252] PM: Marking nosave pages: 000000007d6ae000 - 000000007d75e000

[ 169.956228] PM: Marking nosave pages: 000000007f68f000 - 000000007f7ff000

[ 169.963215] PM: Marking nosave pages: 000000007f800000 - 0000000100000000

[ 169.973855] PM: Basic memory bitmaps created

[ 169.978314] PM: Syncing filesystems ... done.

[ 170.045487] Freezing user space processes ... (elapsed 0.01 seconds) done.

[ 170.067268] PM: Preallocating image memory... done (allocated 505915 pages)

[ 172.421628] PM: Allocated 2023660 kbytes in 2.34 seconds (864.81 MB/s)

[ 172.428334] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.

[ 172.453631] ------------[ cut here ]------------

[ 172.456007] WARNING: at kernel/lockdep.c:956 __bfs+0x1f5/0x270()

[ 172.456007] Hardware name: IBM System x -[7870C4Q]-

[ 172.456007] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod shpchp i7core_edac cdc_ether i2c_i801 ioatdma bnx2 usbnet iTCO_wdt edac_core pci_hotplug dca i2c_core tpm_tis pcspkr serio_raw tpm mii sg rtc_cmos button iTCO_vendor_support tpm_bios uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon

[ 172.456007] Pid: 6166, comm: bash Tainted: G W 3.2.0-33-default #3

[ 172.456007] Call Trace:

[ 172.456007] [<ffffffff81095415>] ? __bfs+0x1f5/0x270

[ 172.456007] [<ffffffff8103ceca>] warn_slowpath_common+0x7a/0xb0

[ 172.456007] [<ffffffff81492bab>] ? _raw_spin_unlock_irq+0x2b/0x50

[ 172.456007] [<ffffffff8103cf15>] warn_slowpath_null+0x15/0x20

[ 172.456007] [<ffffffff81095415>] __bfs+0x1f5/0x270

[ 172.456007] [<ffffffff81095100>] ? noop_count+0x10/0x10

[ 172.456007] [<ffffffff8109892b>] check_usage_backwards+0x6b/0xf0

[ 172.456007] [<ffffffff81496d66>] ? notifier_call_chain+0x26/0x110

[ 172.456007] [<ffffffff810988c0>] ? check_usage_forwards+0x110/0x110

[ 172.456007] [<ffffffff8109930a>] mark_lock_irq+0x9a/0x200

[ 172.456007] [<ffffffff8109957c>] mark_lock+0x10c/0x1b0

[ 172.456007] [<ffffffff81099740>] mark_irqflags+0x120/0x190

[ 172.456007] [<ffffffff8109ae60>] __lock_acquire+0x460/0x520

[ 172.456007] [<ffffffff8109b041>] lock_acquire+0x121/0x190

[ 172.456007] [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[ 172.456007] [<ffffffff81099830>] ? mark_held_locks+0x80/0x120

[ 172.456007] [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[ 172.456007] [<ffffffff814922cc>] _raw_spin_lock+0x3c/0x50

[ 172.456007] [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[ 172.456007] [<ffffffff81473fc0>] klist_next+0x30/0x110

[ 172.456007] [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[ 172.456007] [<ffffffff81336d12>] device_for_each_child+0x52/0x70

[ 172.456007] [<ffffffff81342241>] __device_suspend+0x31/0x2e0

[ 172.456007] [<ffffffff81342550>] dpm_suspend+0x60/0x260

[ 172.456007] [<ffffffff810ad107>] hibernation_snapshot+0xd7/0x1b0

[ 172.456007] [<ffffffff810ad497>] hibernate+0x137/0x260

[ 172.456007] [<ffffffff810aaea4>] state_store+0x114/0x140

[ 172.456007] [<ffffffff8126c3a7>] kobj_attr_store+0x17/0x20

[ 172.456007] [<ffffffff811e9214>] sysfs_write_file+0xd4/0x150

[ 172.456007] [<ffffffff81173aeb>] vfs_write+0xcb/0x130

[ 172.456007] [<ffffffff81173c40>] sys_write+0x50/0x90

[ 172.456007] [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b

[ 172.456007] ---[ end trace b286a5fb903f51e7 ]---

[ 172.456007] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010

[ 172.456007] IP: [<ffffffff81095309>] __bfs+0xe9/0x270

[ 172.456007] PGD 1146b62067 PUD 1147c8b067 PMD 0

[ 172.456007] Oops: 0000 [#1] SMP

[ 172.456007] CPU 2

[ 172.456007] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod shpchp i7core_edac cdc_ether i2c_i801 ioatdma bnx2 usbnet iTCO_wdt edac_core pci_hotplug dca i2c_core tpm_tis pcspkr serio_raw tpm mii sg rtc_cmos button iTCO_vendor_support tpm_bios uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon

[ 172.456007]

[ 172.456007] Pid: 6166, comm: bash Tainted: G W 3.2.0-33-default #3 IBM IBM System x -[7870C4Q]-/68Y8033

[ 172.456007] RIP: 0010:[<ffffffff81095309>] [<ffffffff81095309>] __bfs+0xe9/0x270

[ 172.456007] RSP: 0018:ffff8808c6607ab8 EFLAGS: 00010096

[ 172.456007] RAX: 0000000000000000 RBX: ffffffff820aa330 RCX: ffff8808c73b13c0

[ 172.456007] RDX: ffff8808c73b13c0 RSI: 0000000000000001 RDI: 0000000000000009

[ 172.456007] RBP: ffff8808c6607b18 R08: ffff8808c6607ae8 R09: 0000000000000000

[ 172.456007] R10: 0000000000000000 R11: 0000000000000004 R12: 0000000000000000

[ 172.456007] R13: 0000000001f87ca6 R14: ffff8808c73b13c0 R15: 0000000000000000

[ 172.456007] FS: 00007f46e193a700(0000) GS:ffff8808ffc80000(0000) knlGS:0000000000000000

[ 172.456007] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[ 172.456007] CR2: 0000000000000010 CR3: 00000011476b7000 CR4: 00000000000006e0

[ 172.456007] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[ 172.456007] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

[ 172.456007] Process bash (pid: 6166, threadinfo ffff8808c6606000, task ffff8808c73b13c0)

[ 172.456007] Stack:

[ 172.456007] ffff8808c6607b18 ffff8808c6607ae8 ffff8808c6607b70 ffffffff81095100

[ 172.456007] 0000000000000000 ffffffff816016d0 ffff8808c6607b28 ffff8808c6607b28

[ 172.456007] ffff8808c73b1cc8 ffffffff81779414 ffff8808c73b13c0 ffff8808c73b13c0

[ 172.456007] Call Trace:

[ 172.456007] [<ffffffff81095100>] ? noop_count+0x10/0x10

[ 172.456007] [<ffffffff8109892b>] check_usage_backwards+0x6b/0xf0

[ 172.456007] [<ffffffff81496d66>] ? notifier_call_chain+0x26/0x110

[ 172.456007] [<ffffffff810988c0>] ? check_usage_forwards+0x110/0x110

[ 172.456007] [<ffffffff8109930a>] mark_lock_irq+0x9a/0x200

[ 172.456007] [<ffffffff8109957c>] mark_lock+0x10c/0x1b0

[ 172.456007] [<ffffffff81099740>] mark_irqflags+0x120/0x190

[ 172.456007] [<ffffffff8109ae60>] __lock_acquire+0x460/0x520

[ 172.456007] [<ffffffff8109b041>] lock_acquire+0x121/0x190

[ 172.456007] [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[ 172.456007] [<ffffffff81099830>] ? mark_held_locks+0x80/0x120

[ 172.456007] [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[ 172.456007] [<ffffffff814922cc>] _raw_spin_lock+0x3c/0x50

[ 172.456007] [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[ 172.456007] [<ffffffff81473fc0>] klist_next+0x30/0x110

[ 172.456007] [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[ 172.456007] [<ffffffff81336d12>] device_for_each_child+0x52/0x70

[ 172.456007] [<ffffffff81342241>] __device_suspend+0x31/0x2e0

[ 172.456007] [<ffffffff81342550>] dpm_suspend+0x60/0x260

[ 172.456007] [<ffffffff810ad107>] hibernation_snapshot+0xd7/0x1b0

[ 172.456007] [<ffffffff810ad497>] hibernate+0x137/0x260

[ 172.456007] [<ffffffff810aaea4>] state_store+0x114/0x140

[ 172.456007] [<ffffffff8126c3a7>] kobj_attr_store+0x17/0x20

[ 172.456007] [<ffffffff811e9214>] sysfs_write_file+0xd4/0x150

[ 172.456007] [<ffffffff81173aeb>] vfs_write+0xcb/0x130

[ 172.456007] [<ffffffff81173c40>] sys_write+0x50/0x90

[ 172.456007] [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b

[ 172.456007] Code: 00 45 85 ff 48 0f 44 d8 4c 8b 23 49 39 dc 74 aa 4d 89 e5 49 81 ed 80 d6 e0 81 49 c1 fd 06 4c 3b 2d 05 83 d7 00 0f 83 f7 00 00 00 <49> 8b 54 24 10 8b 42 2c 3b 05 d1 01 43 01 0f 84 88 00 00 00 4c

[ 172.456007] RIP [<ffffffff81095309>] __bfs+0xe9/0x270

[ 172.456007] RSP <ffff8808c6607ab8>

[ 172.456007] CR2: 0000000000000010

[ 172.456007] ---[ end trace b286a5fb903f51e8 ]---

[ 172.456007] BUG: sleeping function called from invalid context at kernel/rwsem.c:21

[ 172.456007] in_atomic(): 1, irqs_disabled(): 1, pid: 6166, name: bash

[ 172.456007] INFO: lockdep is turned off.

[ 172.456007] irq event stamp: 4430551

[ 172.456007] hardirqs last enabled at (4430551): [<ffffffff8148f5ff>] __mutex_unlock_slowpath+0xef/0x180

[ 172.456007] hardirqs last disabled at (4430550): [<ffffffff8148f559>] __mutex_unlock_slowpath+0x49/0x180

[ 172.456007] softirqs last enabled at (4424302): [<ffffffff810451be>] __do_softirq+0x21e/0x3a0

[ 172.456007] softirqs last disabled at (4424285): [<ffffffff8149d16c>] call_softirq+0x1c/0x30

[ 172.456007] Pid: 6166, comm: bash Tainted: G D W 3.2.0-33-default #3

[ 172.456007] Call Trace:

[ 172.456007] [<ffffffff81072992>] __might_sleep+0x152/0x1f0

[ 172.456007] [<ffffffff8149013f>] down_read+0x1f/0x60

[ 172.456007] [<ffffffff810550ff>] exit_signals+0x1f/0x140

[ 172.456007] [<ffffffff8106c411>] ? blocking_notifier_call_chain+0x11/0x20

[ 172.456007] [<ffffffff81042742>] do_exit+0xb2/0x480

[ 172.456007] [<ffffffff81493db4>] oops_end+0xe4/0xf0

[ 172.456007] [<ffffffff8102f609>] no_context+0x119/0x200

[ 172.456007] [<ffffffff8102f83d>] __bad_area_nosemaphore+0x14d/0x220

[ 172.456007] [<ffffffff8102f91e>] bad_area_nosemaphore+0xe/0x10

[ 172.456007] [<ffffffff81496c0e>] do_page_fault+0x38e/0x4c0

[ 172.456007] [<ffffffff8109650d>] ? trace_hardirqs_off+0xd/0x10

[ 172.456007] [<ffffffff8103df92>] ? vprintk+0x362/0x530

[ 172.456007] [<ffffffff8127725d>] ? trace_hardirqs_off_thunk+0x3a/0x3c

[ 172.456007] [<ffffffff814931c5>] page_fault+0x25/0x30

[ 172.456007] [<ffffffff81095309>] ? __bfs+0xe9/0x270

[ 172.456007] [<ffffffff81095415>] ? __bfs+0x1f5/0x270

[ 172.456007] [<ffffffff81095100>] ? noop_count+0x10/0x10

[ 172.456007] [<ffffffff8109892b>] check_usage_backwards+0x6b/0xf0

[ 172.456007] [<ffffffff81496d66>] ? notifier_call_chain+0x26/0x110

[ 172.456007] [<ffffffff810988c0>] ? check_usage_forwards+0x110/0x110

[ 172.456007] [<ffffffff8109930a>] mark_lock_irq+0x9a/0x200

[ 172.456007] [<ffffffff8109957c>] mark_lock+0x10c/0x1b0

[ 172.456007] [<ffffffff81099740>] mark_irqflags+0x120/0x190

[ 172.456007] [<ffffffff8109ae60>] __lock_acquire+0x460/0x520

[ 172.456007] [<ffffffff8109b041>] lock_acquire+0x121/0x190

[ 172.456007] [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[ 172.456007] [<ffffffff81099830>] ? mark_held_locks+0x80/0x120

[ 172.456007] [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[ 172.456007] [<ffffffff814922cc>] _raw_spin_lock+0x3c/0x50

[ 172.456007] [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[ 172.456007] [<ffffffff81473fc0>] klist_next+0x30/0x110

[ 172.456007] [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[ 172.456007] [<ffffffff81336d12>] device_for_each_child+0x52/0x70

[ 172.456007] [<ffffffff81342241>] __device_suspend+0x31/0x2e0

[ 172.456007] [<ffffffff81342550>] dpm_suspend+0x60/0x260

[ 172.456007] [<ffffffff810ad107>] hibernation_snapshot+0xd7/0x1b0

[ 172.456007] [<ffffffff810ad497>] hibernate+0x137/0x260

[ 172.456007] [<ffffffff810aaea4>] state_store+0x114/0x140

[ 172.456007] [<ffffffff8126c3a7>] kobj_attr_store+0x17/0x20

[ 172.456007] [<ffffffff811e9214>] sysfs_write_file+0xd4/0x150

[ 172.456007] [<ffffffff81173aeb>] vfs_write+0xcb/0x130

[ 172.456007] [<ffffffff81173c40>] sys_write+0x50/0x90

[ 172.456007] [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b

[ 172.456007] note: bash[6166] exited with preempt_count 1

[ 173.455156] BUG: scheduling while atomic: bash/6166/0x10000002

[ 173.461156] INFO: lockdep is turned off.

[ 173.465253] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod shpchp i7core_edac cdc_ether i2c_i801 ioatdma bnx2 usbnet iTCO_wdt edac_core pci_hotplug dca i2c_core tpm_tis pcspkr serio_raw tpm mii sg rtc_cmos button iTCO_vendor_support tpm_bios uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon

[ 173.513250] Pid: 6166, comm: bash Tainted: G D W 3.2.0-33-default #3

[ 173.523507] Call Trace:

[ 173.526124] [<ffffffff81073157>] __schedule_bug+0x97/0xa0

[ 173.531777] [<ffffffff814910b5>] __schedule+0x705/0x9a0

[ 173.537268] [<ffffffff8116a8c8>] ? mem_cgroup_update_page_stat+0x148/0x250

[ 173.544396] [<ffffffff8116a7d1>] ? mem_cgroup_update_page_stat+0x51/0x250

[ 173.551451] [<ffffffff81150eba>] ? alloc_pages_current+0xaa/0x120

[ 173.557801] [<ffffffff81492c76>] ? _raw_spin_unlock+0x26/0x40

[ 173.563806] [<ffffffff811336c4>] ? zap_pte_range+0x84/0x3b0

[ 173.569634] [<ffffffff811337f5>] ? zap_pte_range+0x1b5/0x3b0

[ 173.575547] [<ffffffff81496ef6>] ? __atomic_notifier_call_chain+0xa6/0x130

[ 173.582689] [<ffffffff81078af5>] __cond_resched+0x25/0x40

[ 173.588345] [<ffffffff814913dd>] _cond_resched+0x2d/0x40

[ 173.593914] [<ffffffff811342ce>] unmap_page_range+0x25e/0x300

[ 173.599921] [<ffffffff8113443c>] unmap_vmas+0xcc/0x150

[ 173.605319] [<ffffffff81139dbd>] exit_mmap+0x8d/0x120

[ 173.610633] [<ffffffff8103ac3c>] mmput+0x6c/0x150

[ 173.615595] [<ffffffff8103ffca>] exit_mm+0x10a/0x140

[ 173.620812] [<ffffffff81492bab>] ? _raw_spin_unlock_irq+0x2b/0x50

[ 173.627168] [<ffffffff8130f413>] ? tty_audit_exit+0x23/0xa0

[ 173.633001] [<ffffffff810427e3>] do_exit+0x153/0x480

[ 173.638224] [<ffffffff81493db4>] oops_end+0xe4/0xf0

[ 173.643358] [<ffffffff8102f609>] no_context+0x119/0x200

[ 173.648848] [<ffffffff8102f83d>] __bad_area_nosemaphore+0x14d/0x220

[ 173.655371] [<ffffffff8102f91e>] bad_area_nosemaphore+0xe/0x10

[ 173.661462] [<ffffffff81496c0e>] do_page_fault+0x38e/0x4c0

[ 173.667204] [<ffffffff8109650d>] ? trace_hardirqs_off+0xd/0x10

[ 173.673291] [<ffffffff8103df92>] ? vprintk+0x362/0x530

[ 173.678684] [<ffffffff8127725d>] ? trace_hardirqs_off_thunk+0x3a/0x3c

[ 173.685385] [<ffffffff814931c5>] page_fault+0x25/0x30

[ 173.690696] [<ffffffff81095309>] ? __bfs+0xe9/0x270

[ 173.695837] [<ffffffff81095415>] ? __bfs+0x1f5/0x270

[ 173.701060] [<ffffffff81095100>] ? noop_count+0x10/0x10

[ 173.706542] [<ffffffff8109892b>] check_usage_backwards+0x6b/0xf0

[ 173.712811] [<ffffffff81496d66>] ? notifier_call_chain+0x26/0x110

[ 173.719141] [<ffffffff810988c0>] ? check_usage_forwards+0x110/0x110

[ 173.725672] [<ffffffff8109930a>] mark_lock_irq+0x9a/0x200

[ 173.731324] [<ffffffff8109957c>] mark_lock+0x10c/0x1b0

[ 173.736723] [<ffffffff81099740>] mark_irqflags+0x120/0x190

[ 173.742466] [<ffffffff8109ae60>] __lock_acquire+0x460/0x520

[ 173.748297] [<ffffffff8109b041>] lock_acquire+0x121/0x190

[ 173.753956] [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[ 173.759523] [<ffffffff81099830>] ? mark_held_locks+0x80/0x120

[ 173.765528] [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[ 173.772079] [<ffffffff814922cc>] _raw_spin_lock+0x3c/0x50

[ 173.777748] [<ffffffff81473fc0>] ? klist_next+0x30/0x110

[ 173.783317] [<ffffffff81473fc0>] klist_next+0x30/0x110

[ 173.788738] [<ffffffff81341700>] ? device_pm_wait_for_dev+0x40/0x40

[ 173.795264] [<ffffffff81336d12>] device_for_each_child+0x52/0x70

[ 173.801534] [<ffffffff81342241>] __device_suspend+0x31/0x2e0

[ 173.807451] [<ffffffff81342550>] dpm_suspend+0x60/0x260

[ 173.812938] [<ffffffff810ad107>] hibernation_snapshot+0xd7/0x1b0

[ 173.819201] [<ffffffff810ad497>] hibernate+0x137/0x260

[ 173.824598] [<ffffffff810aaea4>] state_store+0x114/0x140

[ 173.830168] [<ffffffff8126c3a7>] kobj_attr_store+0x17/0x20

[ 173.835915] [<ffffffff811e9214>] sysfs_write_file+0xd4/0x150

[ 173.841833] [<ffffffff81173aeb>] vfs_write+0xcb/0x130

[ 173.847147] [<ffffffff81173c40>] sys_write+0x50/0x90

[ 173.852376] [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center

--

Linus Torvalds

unread,

Jan 13, 2012, 4:08:08 PM1/13/12

to Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua

On Fri, Jan 13, 2012 at 12:53 PM, Srivatsa S. Bhat
<srivat...@linux.vnet.ibm.com> wrote:
>
> Wait a minute, did you mention "second attempt"? I think I have something
> interesting..

Yes, I think you're hitting the exact same thing.

I *think* that what is going on is that we free some data structure
too early, and we didn't use to free them before. I tried to see if I
could catch it with slab and list debugging, but I didn't see
anything, and the machine I used for suspend/resume had other issues
too (wireless network - which is the *only* network on that machine -
hung on resume), so I ended up punting and just disabling MCE to
concentrate on those issues.

On eof the differences between sysdev and 'struct device' is that
sysdev doesn't bother refcounting parents etc. So there could have
been some refcount problem that was never relevant with the old sysdev
code. I dunno.

The wireless issues got resolved for me, and I haven't gotten back to
MCE yet. I was *really* hoping that somebody else could figure it out,
since I'm not the only one seeing it..

Linus

Andi Kleen

unread,

Jan 13, 2012, 4:24:13 PM1/13/12

to Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua

> Yes, I think you're hitting the exact same thing.
>
> I *think* that what is going on is that we free some data structure
> too early, and we didn't use to free them before.

Assuming the problem is with the multiple devices (not fully sure):

The redundant devices don't make much sense anyways, they are all
mirrors of each other
(state changes always affect all CPUs) Could just collapse down to a
single device and symlink
the others.

-Andi

Justin P. Mattock

unread,

Jan 13, 2012, 4:38:44 PM1/13/12

to Andi Kleen, Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, Jeff Chua

On 01/13/2012 01:24 PM, Andi Kleen wrote:
>
>> Yes, I think you're hitting the exact same thing.
>>
>> I *think* that what is going on is that we free some data structure
>> too early, and we didn't use to free them before.
>
> Assuming the problem is with the multiple devices (not fully sure):
>
> The redundant devices don't make much sense anyways, they are all
> mirrors of each other
> (state changes always affect all CPUs) Could just collapse down to a
> single device and symlink
> the others.
>
> -Andi
>
>

this showed up using no_console_suspend

131.875143] usb 5-1: device descriptor read/64, error -110
[ 140.599340] PM: Syncing filesystems ... done.
[ 140.815981] PM: Preparing system for mem sleep
[ 140.829117] Freezing user space processes ... (elapsed 0.01 seconds)
done.
[ 140.840150] Freezing remaining freezable tasks ...
[ 147.079160] usb 5-1: device descriptor read/64, error -110
[ 147.282166] usb 5-1: new full-speed USB device number 6 using uhci_hcd
[ 157.686165] usb 5-1: device not accepting address 6, error -110
[ 157.788183] usb 5-1: new full-speed USB device number 7 using uhci_hcd
[ 160.849310]
[ 160.849320] Freezing of tasks failed after 20.00 seconds (1 tasks
refusing to freeze, wq_busy=0):
[ 160.849460] khubd D f5d90020 0 20 2 0x00000000
[ 160.849471] f5d95d50 00000046 f5d095d0 f5d90020 00000000 c16ec3c0
bce8e78a 00000024
[ 160.849488] c16ec3c0 bce7b7f6 00000024 f60063c0 f5d09170 f5d95d20
c120490b c1039aa6
[ 160.849505] 00000000 00000046 c1721180 00000296 f5d95d70 f5d95d40
c1465208 00000000
[ 160.849521] Call Trace:
[ 160.849538] [<c120490b>] ? do_raw_spin_lock+0x3b/0xf0
[ 160.849548] [<c1039aa6>] ? lock_timer_base.isra.24+0x26/0x50
[ 160.849558] [<c1465208>] ? _raw_spin_lock_irqsave+0x58/0x70
[ 160.849567] [<c1204a4e>] ? do_raw_spin_unlock+0x4e/0x90
[ 160.849574] [<c1463c30>] schedule+0x30/0x50
[ 160.849582] [<c1461b7f>] schedule_timeout+0x10f/0x1f0
[ 160.849589] [<c10396f0>] ? usleep_range+0x40/0x40
[ 160.849597] [<c1463ae0>] wait_for_common+0xb0/0x120
[ 160.849605] [<c1053bb0>] ? try_to_wake_up+0x260/0x260
[ 160.849614] [<c1463bdd>] wait_for_completion_timeout+0xd/0x10
[ 160.849624] [<c13250e1>] usb_start_wait_urb+0xb1/0xe0
[ 160.849632] [<c10e0101>] ? sys_swapon+0xab1/0xc50
[ 160.849640] [<c1325308>] usb_control_msg+0xb8/0xf0
[ 160.849648] [<c12ad1e8>] ? _dev_info+0x28/0x30
[ 160.849656] [<c131e627>] hub_port_init+0x627/0x710
[ 160.849664] [<c131d396>] ? usb_set_device_state+0x76/0x130
[ 160.849672] [<c1320906>] hub_thread+0x626/0x1080
[ 160.849681] [<c10515a1>] ? finish_task_switch+0x31/0xf0
[ 160.849688] [<c14635c0>] ? __schedule+0x3b0/0x7b0
[ 160.849698] [<c10490c0>] ? __init_waitqueue_head+0x50/0x50
[ 160.849705] [<c1050ef9>] ? complete+0x49/0x60
[ 160.849713] [<c13202e0>] ? usb_remote_wakeup+0x40/0x40
[ 160.849720] [<c1048928>] kthread+0x78/0x80
[ 160.849728] [<c10488b0>] ? __init_kthread_worker+0x60/0x60
[ 160.849736] [<c146b0fe>] kernel_thread_helper+0x6/0xd
[ 160.849755]
[ 160.849759] Restarting tasks ... done.
[ 160.865733] power_supply BAT0: uevent
[ 160.865737] power_supply BAT0: POWER_SUPPLY_NAME=BAT0
[ 160.886551] power_supply BAT0: prop STATUS=Full
[ 160.886562] power_supply BAT0: prop PRESENT=1
[ 160.886570] power_supply BAT0: prop TECHNOLOGY=Unknown
[ 160.886577] power_supply BAT0: prop CYCLE_COUNT=0

I can supply full dmesg if needed.
a bisect on this should not take too long, just need the time to do so.

last good kernel I have here is: 3.2.0-06541-gf33180c

Justin P. Mattock

Srivatsa S. Bhat

unread,

Jan 13, 2012, 5:06:58 PM1/13/12

to Justin P. Mattock, Andi Kleen, Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, Jeff Chua

Freezing failure is a totally different problem. Freezing happens much
before CPUs are taken offline and even before devices are suspended.
But yes, if freezing fails, suspend fails too (it is aborted rather).
And freezing failures are typically a bit harder to trigger since they
occur due to some race conditions. But the suspend failure problem
discussed earlier in this thread (while discussing the MCE warnings) is a
deterministic thing and very easily reproducible.

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center

--

Alan Stern

unread,

Jan 13, 2012, 5:17:15 PM1/13/12

to Srivatsa S. Bhat, Justin P. Mattock, Andi Kleen, Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, Jeff Chua

The freezing failure is easy to debug. The khubd thread was busy
trying to initialize a non-working USB device. It doesn't check for
freezes while doing this, and it has a lot of (probably too many)
nested retry loops with long delays.

If the non-working USB device were unplugged, the problem would go
away.

Alan Stern

Srivatsa S. Bhat

unread,

Jan 13, 2012, 5:18:27 PM1/13/12

to Justin P. Mattock, Andi Kleen, Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, Jeff Chua, Tejun Heo, Alan Stern

So, looks like we have got 2 problems: a)freezing failure apparently due to
usb related code b) suspend failure due to MCE overhaul.

Adding Tejun and Alan Stern to Cc.

Regards,
Srivatsa S. Bhat

Linus Torvalds

unread,

Jan 13, 2012, 6:02:01 PM1/13/12

to Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua

On Fri, Jan 13, 2012 at 12:22 PM, Srivatsa S. Bhat
<srivat...@linux.vnet.ibm.com> wrote:
>
> Fundamentally, this warning is triggered during CPU Offline, which is done
> during poweroff, suspend, hibernate etc. IOW, even a simple
> # echo 0 > /sys/devices/system/cpu/cpuX/online will trigger it.

There is definitely something wrong with CPU hotplug and MCE.

I seem to be able to trigger not only warnings, but some oopses, by doing:

- enable list debugging, slab debugging, and kobject debugging in the
kernel (I've got some other things enabled too, but I think those are
the main ones)

- do

echo 0 > /sys/devices/system/cpu/cpuX/online

this gets a few warnings

- then do

echo 1 > /sys/devices/system/cpu/cpuX/online

where bringing it up again will crash the machine entirely.

The oops scrolled off the screen an ddidn't get caught anywhere, but
the call trace seems to be (warning: hand-entered, so some of this may
be bogus):

Oops in:
kobject_get+0x10/0x40

Code:
55 48 89 f8 48 89 e5 48 83 ec 10 48 85 ff 74 0b <8b> 57 38 85 d2 74 06 f0 ff

Call trace:
get_device
klist_device_get
klist_add_tail
bus_add_device
device_add
device_register
mce_device_create
notifier_call_chain
__raw_notifier_call_chain
__cpu_notify
_cpu_up
store_online
dev_attr_change
sysfs_write_file

so it's definitely something bad in MCE device handling, and probably
something to do with reusing a 'struct device' after freeign it, or
after not having completely cleaned it up.

I didn't see if I could spot the problem, but I think this is entirely
reproducible, so hopefully somebody who knows the MCE code can
trivially see this and fix it.

Linus

Andi Kleen

unread,

Jan 13, 2012, 6:13:52 PM1/13/12

to Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua

> Assuming the problem is with the multiple devices (not fully sure):
>
> The redundant devices don't make much sense anyways, they are all
> mirrors of each other
> (state changes always affect all CPUs) Could just collapse down to a
> single device and symlink
> the others.

Checking again it seems like the AMD threshold code reports per CPU
counters so it would
still need per CPU devices.

All the other code is just a single instance I believe and doesn't
actually need all the
complicated per CPU code.

Srivatsa S. Bhat

unread,

Jan 13, 2012, 6:27:51 PM1/13/12

to Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On 01/14/2012 04:32 AM, Linus Torvalds wrote:

> On Fri, Jan 13, 2012 at 12:22 PM, Srivatsa S. Bhat
> <srivat...@linux.vnet.ibm.com> wrote:
>>
>> Fundamentally, this warning is triggered during CPU Offline, which is done
>> during poweroff, suspend, hibernate etc. IOW, even a simple
>> # echo 0 > /sys/devices/system/cpu/cpuX/online will trigger it.
>
> There is definitely something wrong with CPU hotplug and MCE.
>
> I seem to be able to trigger not only warnings, but some oopses, by doing:
>
> - enable list debugging, slab debugging, and kobject debugging in the
> kernel (I've got some other things enabled too, but I think those are
> the main ones)
>
> - do
>
> echo 0 > /sys/devices/system/cpu/cpuX/online
>
> this gets a few warnings
>
> - then do
>
> echo 1 > /sys/devices/system/cpu/cpuX/online
>
> where bringing it up again will crash the machine entirely.
>

I observed this too; and it is very easy to reproduce.
Here is the log:

# echo 0 > /sys/devices/system/cpu/cpu1/online

[ 65.091045] CPU 1 is now offline
[ 65.097267] ------------[ cut here ]------------
[ 65.102045] WARNING: at drivers/base/core.c:194 device_release+0x82/0x90()
[ 65.109137] Hardware name: IBM System x -[7870C4Q]-
[ 65.109139] Device 'machinecheck1' does not have a release() function, it is broken and must be fixed.
[ 65.109141] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod cdc_ether usbnet i7core_edac edac_core mii serio_raw i2c_i801 shpchp ioatdma iTCO_wdt iTCO_vendor_support dca pci_hotplug pcspkr bnx2 i2c_core tpm_tis tpm tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 65.109195] Pid: 6631, comm: bash Not tainted 3.2.0-debugkernel-0.0.0.28.36b5ec9-default #4
[ 65.109197] Call Trace:
[ 65.109202] [<ffffffff8133b462>] ? device_release+0x82/0x90
[ 65.109208] [<ffffffff8103cc2a>] warn_slowpath_common+0x7a/0xb0
[ 65.109212] [<ffffffff8103cd01>] warn_slowpath_fmt+0x41/0x50
[ 65.109216] [<ffffffff8133b462>] device_release+0x82/0x90
[ 65.109223] [<ffffffff8127051e>] ? kobj_kset_leave+0x1e/0x60
[ 65.109228] [<ffffffff8127060d>] kobject_cleanup+0x6d/0x1b0
[ 65.109233] [<ffffffff8127075d>] kobject_release+0xd/0x10
[ 65.109237] [<ffffffff812704ab>] kobject_put+0x2b/0x60
[ 65.109241] [<ffffffff8133ab42>] put_device+0x12/0x20
[ 65.109245] [<ffffffff8133bfc5>] device_unregister+0x25/0x60
[ 65.109252] [<ffffffff8148a22f>] mce_cpu_callback+0x149/0x1a5
[ 65.109257] [<ffffffff8149b4a2>] notifier_call_chain+0x72/0x110
[ 65.109263] [<ffffffff8106bf19>] __raw_notifier_call_chain+0x9/0x10
[ 65.109270] [<ffffffff8147b9b6>] _cpu_down+0x1c6/0x320
[ 65.109274] [<ffffffff8147bb4b>] cpu_down+0x3b/0x60
[ 65.109279] [<ffffffff8147db1d>] store_online+0x6d/0xc8
[ 65.109283] [<ffffffff8133a70b>] dev_attr_store+0x1b/0x20
[ 65.109288] [<ffffffff811ecb04>] sysfs_write_file+0xd4/0x150
[ 65.109295] [<ffffffff81176d1b>] vfs_write+0xcb/0x130
[ 65.109299] [<ffffffff81176e70>] sys_write+0x50/0x90
[ 65.109304] [<ffffffff814a0379>] system_call_fastpath+0x16/0x1b
[ 65.109307] ---[ end trace dafb3fda8041063e ]---
[ 65.112016] ------------[ cut here ]------------
[ 65.112024] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[ 65.112027] Hardware name: IBM System x -[7870C4Q]-
[ 65.112028] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod cdc_ether usbnet i7core_edac edac_core mii serio_raw i2c_i801 shpchp ioatdma iTCO_wdt iTCO_vendor_support dca pci_hotplug pcspkr bnx2 i2c_core tpm_tis tpm tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 65.112067] Pid: 2277, comm: udevd Tainted: G W 3.2.0-debugkernel-0.0.0.28.36b5ec9-default #4
[ 65.112070] Call Trace:
[ 65.112071] <IRQ> [<ffffffff81021349>] ? native_smp_send_reschedule+0x59/0x60
[ 65.112079] [<ffffffff8103cc2a>] warn_slowpath_common+0x7a/0xb0
[ 65.112083] [<ffffffff8103cc75>] warn_slowpath_null+0x15/0x20
[ 65.112086] [<ffffffff81021349>] native_smp_send_reschedule+0x59/0x60
[ 65.112092] [<ffffffff810825f5>] trigger_load_balance+0x185/0x4f0
[ 65.112096] [<ffffffff8108262b>] ? trigger_load_balance+0x1bb/0x4f0
[ 65.112101] [<ffffffff81073617>] scheduler_tick+0x107/0x170
[ 65.112107] [<ffffffff8104e057>] update_process_times+0x67/0x80
[ 65.112113] [<ffffffff8109353f>] tick_sched_timer+0x5f/0xc0
[ 65.112117] [<ffffffff810934e0>] ? tick_nohz_handler+0x100/0x100
[ 65.112122] [<ffffffff8106a05e>] __run_hrtimer+0x12e/0x330
[ 65.112126] [<ffffffff8106a4a7>] hrtimer_interrupt+0xc7/0x1f0
[ 65.112131] [<ffffffff81022f64>] smp_apic_timer_interrupt+0x64/0xa0
[ 65.112135] [<ffffffff814a0eb3>] apic_timer_interrupt+0x73/0x80
[ 65.112137] <EOI> [<ffffffff8115f788>] ? __slab_alloc+0x228/0x4e0
[ 65.112145] [<ffffffff810654f0>] ? __wake_up_bit+0x10/0x30
[ 65.112150] [<ffffffff8110b7e5>] unlock_page+0x25/0x30
[ 65.112157] [<ffffffff81135f75>] do_wp_page+0x4f5/0x7b0
[ 65.112161] [<ffffffff8113708d>] handle_pte_fault+0x19d/0x1e0
[ 65.112165] [<ffffffff81137248>] handle_mm_fault+0x178/0x2e0
[ 65.112169] [<ffffffff8149b171>] do_page_fault+0x201/0x4c0
[ 65.112173] [<ffffffff8103c109>] ? do_fork+0x179/0x350
[ 65.112177] [<ffffffff8119900e>] ? mntput+0x1e/0x30
[ 65.112182] [<ffffffff811786ef>] ? __fput+0x16f/0x210
[ 65.112187] [<ffffffff8127ae3d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 65.112192] [<ffffffff81497905>] page_fault+0x25/0x30
[ 65.112195] ---[ end trace dafb3fda8041063f ]---
[ 65.541793] CPU 9 MCA banks CMCI:2 CMCI:3 CMCI:5
[ 75.472229] lockdep: fixing up alternatives.

The above warning is related to the reschedule IPI sent to an offline cpu.
I guess this is due to the recent changes done to nohz_balancer_kick() and
find_new_ilb() in kernel/sched/fair.c. I had never seen this warning before
3.3 merge window, even during CPU Hotplug stress tests. Now this warning
is seen pretty often during CPU offline.

[Adding Suresh Siddha and Peter Zijlstra to Cc.]

# echo 1 > /sys/devices/system/cpu/cpu1/online

[ 75.476772] Booting Node 0 Processor 1 APIC 0x2
[ 75.481495] smpboot cpu 1: start_ip = 97000
[ 75.492927] Calibrating delay loop (skipped) already calibrated this CPU
[ 75.508449] NMI watchdog enabled, takes one hw-pmu counter.
[ 75.515402] general protection fault: 0000 [#1] SMP
[ 75.518940] CPU 7
[ 75.518940] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod cdc_ether usbnet i7core_edac edac_core mii serio_raw i2c_i801 shpchp ioatdma iTCO_wdt iTCO_vendor_support dca pci_hotplug pcspkr bnx2 i2c_core tpm_tis tpm tpm_bios sg rtc_cmos button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 75.518940]
[ 75.518940] Pid: 6631, comm: bash Tainted: G W 3.2.0-debugkernel-0.0.0.28.36b5ec9-default #4 IBM IBM System x -[7870C4Q]-/68Y8033
[ 75.518940] RIP: 0010:[<ffffffff81270779>] [<ffffffff81270779>] kobject_get+0x19/0x60
[ 75.518940] RSP: 0018:ffff8808c6cc7c18 EFLAGS: 00010206
[ 75.518940] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b7b RCX: 0000000000000006
[ 75.518940] RDX: ffffffff81e98ae0 RSI: ffff8808ccc93080 RDI: 6b6b6b6b6b6b6b7b
[ 75.518940] RBP: ffff8808c6cc7c28 R08: 5ff145670d8e439e R09: 0000000000000000
[ 75.518940] R10: 0000000000000005 R11: 0000000000000001 R12: ffff88114ded3608
[ 75.518940] R13: ffffffff81a13440 R14: ffff8808ddc4cb60 R15: 0000000000000001
[ 75.518940] FS: 00007f9a3218e700(0000) GS:ffff88117fcc0000(0000) knlGS:0000000000000000
[ 75.518940] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 75.518940] CR2: 000000000068a2a0 CR3: 000000114bd59000 CR4: 00000000000006e0
[ 75.518940] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 75.518940] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 75.518940] Process bash (pid: 6631, threadinfo ffff8808c6cc6000, task ffff8808c6d9c600)
[ 75.518940] Stack:
[ 75.518940] ffff8808ccc93080 ffff88114ded3608 ffff8808c6cc7c38 ffffffff8133ab14
[ 75.518940] ffff8808c6cc7c48 ffffffff8133ddad ffff8808c6cc7c68 ffffffff81478b82
[ 75.518940] ffff88114ded3608 ffff8808ccc93080 ffff8808c6cc7c88 ffffffff81479062
[ 75.518940] Call Trace:
[ 75.518940] [<ffffffff8133ab14>] get_device+0x14/0x20
[ 75.518940] [<ffffffff8133ddad>] klist_devices_get+0xd/0x10
[ 75.518940] [<ffffffff81478b82>] klist_node_init+0x42/0x70
[ 75.518940] [<ffffffff81479062>] klist_add_tail+0x22/0x60
[ 75.518940] [<ffffffff8133e76b>] bus_add_device+0x1bb/0x200
[ 75.518940] [<ffffffff8133c7c7>] device_add+0x2e7/0x570
[ 75.518940] [<ffffffff813479e0>] ? device_pm_init+0x70/0xa0
[ 75.518940] [<ffffffff8133ca69>] device_register+0x19/0x20
[ 75.518940] [<ffffffff81489fe6>] mce_device_create+0x8b/0x18b
[ 75.518940] [<ffffffff8148a26d>] mce_cpu_callback+0x187/0x1a5
[ 75.518940] [<ffffffff8149b4a2>] notifier_call_chain+0x72/0x110
[ 75.518940] [<ffffffff8106bf19>] __raw_notifier_call_chain+0x9/0x10
[ 75.518940] [<ffffffff8148db41>] _cpu_up+0x124/0x12a
[ 75.518940] [<ffffffff8148dc03>] cpu_up+0xbc/0x114
[ 75.518940] [<ffffffff8147db45>] store_online+0x95/0xc8
[ 75.518940] [<ffffffff8133a70b>] dev_attr_store+0x1b/0x20
[ 75.518940] [<ffffffff811ecb04>] sysfs_write_file+0xd4/0x150
[ 75.518940] [<ffffffff81176d1b>] vfs_write+0xcb/0x130
[ 75.518940] [<ffffffff81176e70>] sys_write+0x50/0x90
[ 75.518940] [<ffffffff814a0379>] system_call_fastpath+0x16/0x1b
[ 75.518940] Code: ff ff 55 48 83 ef 38 48 89 e5 e8 43 fe ff ff c9 c3 90 55 48 89 e5 48 83 ec 10 48 85 ff 48 89 1c 24 4c 89 64 24 08 48 89 fb 74 0f <8b> 47 38 4c 8d 67 38 85 c0 74 1c f0 ff 43 38 48 89 d8 4c 8b 64
[ 75.518940] RIP [<ffffffff81270779>] kobject_get+0x19/0x60
[ 75.518940] RSP <ffff8808c6cc7c18>
[ 75.856395] ---[ end trace dafb3fda80410640 ]---

And in a separate try, I got this during cpu online operation:
(Pretty much the same as above, but with the BUG description present.)

[ 83.491328] Booting Node 1 Processor 6 APIC 0x14^M
[ 83.496135] smpboot cpu 6: start_ip = 97000^M
[ 72.494772] Calibrating delay loop (skipped) already calibrated this CPU^M
[ 83.522491] NMI watchdog enabled, takes one hw-pmu counter.^M
[ 83.529016] BUG: unable to handle kernel paging request at 000000350000004a^M
[ 83.532868] IP: [<ffffffff8126cac9>] kobject_get+0x19/0x60^M
[ 83.532868] PGD 8c7909067 PUD 0 ^M
[ 83.532868] Oops: 0000 [#1] SMP ^M
[ 83.532868] CPU 0 ^M
[ 83.532868] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod ioatdma cdc_ether usbnet bnx2 shpchp mii tpm_tis tpm i7core_edac rtc_cmos serio_raw i2c_i801 dca pcspkr pci_hotplug edac_core i2c_core iTCO_wdt iTCO_vendor_support sg tpm_bios button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon^M
[ 83.532868] ^M
[ 83.532868] Pid: 6347, comm: allon_cpu_statu Tainted: G W 3.2.0-33-default #3 IBM IBM System x -[7870C4Q]-/68Y8033 ^M
[ 83.532868] RIP: 0010:[<ffffffff8126cac9>] [<ffffffff8126cac9>] kobject_get+0x19/0x60^M
[ 83.532868] RSP: 0018:ffff8808c78c1c18 EFLAGS: 00010206^M
[ 83.532868] RAX: 0000000000000000 RBX: 0000003500000012 RCX: 0000000000000006^M
[ 83.532868] RDX: ffffffff81f0f180 RSI: ffff8808c7f01118 RDI: 0000003500000012^M
[ 83.532868] RBP: ffff8808c78c1c28 R08: 543148780dbe0391 R09: 0000000000000000^M
[ 83.532868] R10: 0000000000000005 R11: 0000000000000001 R12: ffff8808c9f37d38^M
[ 83.532868] R13: ffffffff81a13440 R14: ffff88117fc8cb60 R15: 0000000000000006^M
[ 83.532868] FS: 00007f7043861700(0000) GS:ffff8808ffc00000(0000) knlGS:0000000000000000^M
[ 83.532868] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[ 83.532868] CR2: 000000350000004a CR3: 00000008c7ee9000 CR4: 00000000000006f0^M
[ 83.532868] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
[ 83.532868] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
[ 83.532868] Process allon_cpu_statu (pid: 6347, threadinfo ffff8808c78c0000, task ffff8808ca7c8bc0)^M
[ 83.532868] Stack:^M
[ 83.532868] ffff8808c7f01118 ffff8808c9f37d38 ffff8808c78c1c38 ffffffff813362e4^M
[ 83.532868] ffff8808c78c1c48 ffffffff8133951d ffff8808c78c1c68 ffffffff81473db2^M
[ 83.532868] ffff8808c9f37d38 ffff8808c7f01118 ffff8808c78c1c88 ffffffff81474292^M
[ 83.532868] Call Trace:^M
[ 83.532868] [<ffffffff813362e4>] get_device+0x14/0x20^M
[ 83.532868] [<ffffffff8133951d>] klist_devices_get+0xd/0x10^M
[ 83.532868] [<ffffffff81473db2>] klist_node_init+0x42/0x70^M
[ 83.532868] [<ffffffff81474292>] klist_add_tail+0x22/0x60^M
[ 83.532868] [<ffffffff81339edb>] bus_add_device+0x1bb/0x200^M
[ 83.532868] [<ffffffff81337f77>] device_add+0x2e7/0x570^M
[ 83.532868] [<ffffffff81343080>] ? device_pm_init+0x70/0xa0^M
[ 83.532868] [<ffffffff81338219>] device_register+0x19/0x20^M
[ 83.532868] [<ffffffff8148537f>] mce_device_create+0x8b/0x18b^M
[ 83.532868] [<ffffffff81485606>] mce_cpu_callback+0x187/0x1a5^M
[ 83.532868] [<ffffffff81496db2>] notifier_call_chain+0x72/0x110^M
[ 83.532868] [<ffffffff8106c1c9>] __raw_notifier_call_chain+0x9/0x10^M
[ 83.532868] [<ffffffff81488dc1>] _cpu_up+0x124/0x12a^M
[ 83.532868] [<ffffffff81488e83>] cpu_up+0xbc/0x114^M
[ 83.532868] [<ffffffff81479065>] store_online+0x95/0xc8^M
[ 83.532868] [<ffffffff81335edb>] dev_attr_store+0x1b/0x20^M
[ 83.532868] [<ffffffff811e9214>] sysfs_write_file+0xd4/0x150^M
[ 83.532868] [<ffffffff81173aeb>] vfs_write+0xcb/0x130^M
[ 83.532868] [<ffffffff81173c40>] sys_write+0x50/0x90^M
[ 83.532868] [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b^M
[ 83.532868] Code: ff ff 55 48 83 ef 38 48 89 e5 e8 43 fe ff ff c9 c3 90 55 48 89 e5 48 83 ec 10 48 85 ff 48 89 1c 24 4c 89 64 24 08 48 89 fb 74 0f <8b> 47 38 4c 8d 67 38 85 c0 74 1c f0 ff 43 38 48 89 d8 4c 8b 64 ^M
[ 83.532868] RIP [<ffffffff8126cac9>] kobject_get+0x19/0x60^M
[ 83.532868] RSP <ffff8808c78c1c18>^M
[ 83.532868] CR2: 000000350000004a^M
[ 83.890209] ---[ end trace fab5021066ee998d ]---^M

> so it's definitely something bad in MCE device handling, and probably
> something to do with reusing a 'struct device' after freeign it, or
> after not having completely cleaned it up.
>
> I didn't see if I could spot the problem, but I think this is entirely
> reproducible, so hopefully somebody who knows the MCE code can
> trivially see this and fix it.
>
> Linus
>

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center

--

Linus Torvalds

unread,

Jan 13, 2012, 7:05:42 PM1/13/12

to Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 3:27 PM, Srivatsa S. Bhat
<srivat...@linux.vnet.ibm.com> wrote:
>
> # echo 1 > /sys/devices/system/cpu/cpu1/online
>
> [ � 75.476772] Booting Node 0 Processor 1 APIC 0x2
> [ � 75.481495] smpboot cpu 1: start_ip = 97000
> [ � 75.492927] Calibrating delay loop (skipped) already calibrated this CPU
> [ � 75.508449] NMI watchdog enabled, takes one hw-pmu counter.
> [ � 75.515402] general protection fault: 0000 [#1] SMP
> [ � 75.518940]

> [ � 75.518940] Pid: 6631, comm: bash Tainted: G � � � �W � �3.2.0-debugkernel-0.0.0.28.36b5ec9-default #4 IBM IBM System x -[7870C4Q]-/68Y8033

> [ � 75.518940] RIP: 0010:[<ffffffff81270779>] �[<ffffffff81270779>] kobject_get+0x19/0x60
> [ � 75.518940] RSP: 0018:ffff8808c6cc7c18 �EFLAGS: 00010206
> [ � 75.518940] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b7b RCX: 0000000000000006
> [ � 75.518940] RDX: ffffffff81e98ae0 RSI: ffff8808ccc93080 RDI: 6b6b6b6b6b6b6b7b

The magic is the %rdi value. The instruction that oopses is

mov 0x38(%rdi),%eax

and "rdi" is 0x10 + the magic 6b6b6b.. pattern. Which is obviously
'poison_free'.

And the 0x10 is because get_device() does

return dev ? to_dev(kobject_get(&dev->kobj)) : NULL;

and I bet "kobj" is at offset 16 in the device structure. So we had a
pointer to a "struct device", but it was loaded from memory that was
free'd, turning the kobject pointer into that 0x6b6b6b6b6b6b6b7b

So somebody got a pointer from free'd memory. That somebody seems to
be 'klist_devices_get()' that got it from a 'struct klist_node', so I
think we have free'd something from the klist_devices list in the bus.
But I dunno. Odd. I would have expected us to hit that invalid pointer
long before if the klist entry was bogus.

I'm not seeign anything obvious in mce.c. But the fact that it's that
magic per_cpu allocation makes me nervous. It uses that magic
"mce_device_initialized" bit array etc, and ti clearly must have
worked before, but it equally clearly does *not* work now.

Looking more at it, I think that maybe something keeps the mce_device
around (refcounts that didn't use to exist before?) so when we
unregister it, it is still in use. And then when we re-register it
when we bring it up, we do that

memset(&dev->kobj, 0, sizeof(struct kobject));

on the device that is in use. I dunno. It's all scary. Somebody who
knows the MCE layer should look at it.

Linus

Dirk Gouders

unread,

Jan 13, 2012, 7:44:59 PM1/13/12

to Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua

Linus Torvalds <torv...@linux-foundation.org> writes:

> On Fri, Jan 13, 2012 at 12:22 PM, Srivatsa S. Bhat
> <srivat...@linux.vnet.ibm.com> wrote:
>> However, it is to be noted that, technically this warning (machinecheck1
>> not having a release() function) is not all that new. Just that people
>> didn't probably notice it earlier (reason explained below).
>
> .. but there really is some semantic issue going on here. I'm not the
> only one for whom suspend/resume simply does not work with MCE
> enabled, and that "not work" coincides with the switch-over from
> sysdev to struct device.
>
> I suspect it is some subtle difference in sysdev_unregister() vs
> device_unregister(). The message change is just a symptom of the fact
> that we go through different paths.
>
> For me MCE hangs only on the *second* suspend, but it sounds like
> others see it on the first one.

Yes, my laptop hangs on the first suspend.
Another effect of the problem is that the laptop's clock is extremely
off after a forced reboot.

Dirk

Srivatsa S. Bhat

unread,

Jan 13, 2012, 9:41:31 PM1/13/12

to Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

YES!! Finally I have a fix for this whole MCE thing! :-)

The patch below works perfectly for me - I tested multiple CPU hotplug
operations as well as multiple pm_test runs at core level. Please let me
know if this solves the suspend issue as well..

Of course, the warnings at drivers/base/core.c: device_release()
as well as the IPI to offline cpu warnings still appear but are rather
unrelated and harmless to the issue being discussed.
So, with this patch CPU hotplug doesn't crash the system and suspend and
hibernate are expected to work fine.

-------
From: Srivatsa S. Bhat <srivat...@linux.vnet.ibm.com>
Subject: [PATCH] x86/mce: Fix CPU hotplug and suspend regression related to MCE

Commit 8a25a2f (cpu: convert 'cpu' and 'machinecheck' sysdev_class
to a regular subsystem) changed how things are dealt with in
the MCE subsystem. Some of the things that got broken due to this
are CPU hotplug and suspend/hibernate.

MCE uses per_cpu allocations of struct device. So, when a CPU goes
offline and comes back online, in order to ensure that we start
from a clean slate with respect to the MCE subsystem, zero out the
entire per_cpu device structure to 0 before using it.

Signed-off-by: Srivatsa S. Bhat <srivat...@linux.vnet.ibm.com>
---

arch/x86/kernel/cpu/mcheck/mce.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index f22a9f7..29ba329 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2011,7 +2011,7 @@ static __cpuinit int mce_device_create(unsigned int cpu)
if (!mce_available(&boot_cpu_data))
return -EIO;

- memset(&dev->kobj, 0, sizeof(struct kobject));
+ memset(dev, 0, sizeof(struct device));
dev->id = cpu;
dev->bus = &mce_subsys;

Linus Torvalds

unread,

Jan 13, 2012, 9:53:04 PM1/13/12

to Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
<srivat...@linux.vnet.ibm.com> wrote:
>
> YES!! Finally I have a fix for this whole MCE thing! :-)

Goodie.

> The patch below works perfectly for me - I tested multiple CPU hotplug
> operations as well as multiple pm_test runs at core level. Please let me
> know if this solves the suspend issue as well..

Ok, I'll try, and I bet it does.

HOWEVER.

I'd be a whole lot happier knowing exactly which field in "struct
device" that needed to be NULL before it gets registered.

I don't like how

device_register() + device_create_file(dev)..

is not sufficiently undone by

.. device_remove_file(dev) + device_unregister()

so that it can't be repeated. Exactly *what* state is stale and
re-used incorrectly if you do that device_register() a second time.

It smells like a misfeature of the device core handling.

But that does obviously explain why this started happening with a
fairly straightforward conversion from sysdev to struct device. It
just makes me worry about any *other* such conversions.

Of course, normal users will allocate and free the memory, so never
see this "re-use the same piece of memory" issue. But still..

Linus

Srivatsa S. Bhat

unread,

Jan 13, 2012, 10:05:59 PM1/13/12

to Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On 01/14/2012 08:23 AM, Linus Torvalds wrote:

> On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
> <srivat...@linux.vnet.ibm.com> wrote:
>>
>> YES!! Finally I have a fix for this whole MCE thing! :-)
>
> Goodie.
>
>> The patch below works perfectly for me - I tested multiple CPU hotplug
>> operations as well as multiple pm_test runs at core level. Please let me
>> know if this solves the suspend issue as well..
>
> Ok, I'll try, and I bet it does.
>
> HOWEVER.
>
> I'd be a whole lot happier knowing exactly which field in "struct
> device" that needed to be NULL before it gets registered.
>
> I don't like how
>
> device_register() + device_create_file(dev)..
>
> is not sufficiently undone by
>
> .. device_remove_file(dev) + device_unregister()
>
> so that it can't be repeated. Exactly *what* state is stale and
> re-used incorrectly if you do that device_register() a second time.
>
> It smells like a misfeature of the device core handling.
>
> But that does obviously explain why this started happening with a
> fairly straightforward conversion from sysdev to struct device. It
> just makes me worry about any *other* such conversions.
>
> Of course, normal users will allocate and free the memory, so never
> see this "re-use the same piece of memory" issue. But still..
>

I totally agree with you. Even I had set out to find out *exactly* what
was going wrong. After spending significant amount of time digging through
the code (unsuccessfully), this idea of zeroing out everything struck me
and it worked, as expected. Yes, it is definitely important to know the
exact issue so that we can fix the driver core and avoid other mishaps,
but I guess finding that out is not all that simple.. as of now I am
rather exhausted following those zillions of pointers continuously
for the past few hours.. ;-/

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center

--

Linus Torvalds

unread,

Jan 13, 2012, 10:10:27 PM1/13/12

to Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 7:05 PM, Srivatsa S. Bhat

<srivat...@linux.vnet.ibm.com> wrote:
>> On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
>> <srivat...@linux.vnet.ibm.com> wrote:
>>
>>> The patch below works perfectly for me - I tested multiple CPU hotplug
>>> operations as well as multiple pm_test runs at core level. Please let me
>>> know if this solves the suspend issue as well..
>>
>> Ok, I'll try, and I bet it does.

Yup, I can suspend and resume multiple times now.

> I totally agree with you. Even I had set out to find out *exactly* what
> was going wrong. After spending significant amount of time digging through
> the code (unsuccessfully), this idea of zeroing out everything struck me
> and it worked, as expected. Yes, it is definitely important to know the
> exact issue so that we can fix the driver core and avoid other mishaps,
> but I guess finding that out is not all that simple.. as of now I am
> rather exhausted following those zillions of pointers continuously
> for the past few hours.. ;-/

Fair enough. The many layers of clever indirection are really really
nasty to follow, I agree.

Linus

Srivatsa S. Bhat

unread,

Jan 13, 2012, 10:18:07 PM1/13/12

to Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On 01/14/2012 08:40 AM, Linus Torvalds wrote:

> On Fri, Jan 13, 2012 at 7:05 PM, Srivatsa S. Bhat
> <srivat...@linux.vnet.ibm.com> wrote:
>>> On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
>>> <srivat...@linux.vnet.ibm.com> wrote:
>>>
>>>> The patch below works perfectly for me - I tested multiple CPU hotplug
>>>> operations as well as multiple pm_test runs at core level. Please let me
>>>> know if this solves the suspend issue as well..
>>>
>>> Ok, I'll try, and I bet it does.
>
> Yup, I can suspend and resume multiple times now.

Wow! Feels good to hear that, at last! :-)

By the way, in my excitement, while sending the patch I forgot to give
credits to the problem reporters with a "Reported-by". Hope that gets
resolved.

>
>> I totally agree with you. Even I had set out to find out *exactly* what
>> was going wrong. After spending significant amount of time digging through
>> the code (unsuccessfully), this idea of zeroing out everything struck me
>> and it worked, as expected. Yes, it is definitely important to know the
>> exact issue so that we can fix the driver core and avoid other mishaps,
>> but I guess finding that out is not all that simple.. as of now I am
>> rather exhausted following those zillions of pointers continuously
>> for the past few hours.. ;-/
>
> Fair enough. The many layers of clever indirection are really really
> nasty to follow, I agree.
>

:-)

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center

--

Linus Torvalds

unread,

Jan 13, 2012, 10:41:22 PM1/13/12

to Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 7:18 PM, Srivatsa S. Bhat
<srivat...@linux.vnet.ibm.com> wrote:
>
> By the way, in my excitement, while sending the patch I forgot to give
> credits to the problem reporters with a "Reported-by". Hope that gets
> resolved.

.. and in my excitement to get it fixed, I committed and pushed it out
without thinking about it either.

So no, reporters didn't get any glory this time.

Sucks to be them,

Linus

Tony Luck

unread,

Jan 14, 2012, 12:15:29 AM1/14/12

to Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 7:10 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> Fair enough. The many layers of clever indirection are really really
> nasty to follow, I agree.

You could cheat and bisect (not git bisect, numerical bisect). We know that:

memset(&dev->kobj, 0, sizeof(struct kobject));

is not enough. And that:

memset(dev, 0, sizeof(struct device));

works. The kobject is near the beginning of struct device. So try clearing
all the other bits from start of struct device up to "kobj" ... if that doesn't
help, then play with clearing various sized chunks after. Should be able to
zoom in on the culprit field ... then go look at the code.

-Tony

Greg KH

unread,

Jan 14, 2012, 9:49:38 AM1/14/12

to Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Fri, Jan 13, 2012 at 06:53:04PM -0800, Linus Torvalds wrote:
> On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
> <srivat...@linux.vnet.ibm.com> wrote:
> >
> > YES!! Finally I have a fix for this whole MCE thing! :-)
>
> Goodie.
>
> > The patch below works perfectly for me - I tested multiple CPU hotplug
> > operations as well as multiple pm_test runs at core level. Please let me
> > know if this solves the suspend issue as well..
>
> Ok, I'll try, and I bet it does.
>
> HOWEVER.
>
> I'd be a whole lot happier knowing exactly which field in "struct
> device" that needed to be NULL before it gets registered.
>
> I don't like how
>
> device_register() + device_create_file(dev)..
>
> is not sufficiently undone by
>
> .. device_remove_file(dev) + device_unregister()
>
> so that it can't be repeated. Exactly *what* state is stale and
> re-used incorrectly if you do that device_register() a second time.
>
> It smells like a misfeature of the device core handling.

It has to do with the fact that this is a "static" device that is being
reused. Normally it would be cleaned up properly in the release
function, but as there isn't one, some fields are being left in a bad
state.

I'll look into this Sunday better when I have the chance, I'm currently
on the road until late tonight, skiing, and it's hard to write patches
from a chair lift...

thanks,

greg k-h

Alan Stern

unread,

Jan 14, 2012, 11:30:03 AM1/14/12

to Greg KH, Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Sat, 14 Jan 2012, Greg KH wrote:

> On Fri, Jan 13, 2012 at 06:53:04PM -0800, Linus Torvalds wrote:
> > On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
> > <srivat...@linux.vnet.ibm.com> wrote:
> > >
> > > YES!! Finally I have a fix for this whole MCE thing! :-)
> >
> > Goodie.
> >
> > > The patch below works perfectly for me - I tested multiple CPU hotplug
> > > operations as well as multiple pm_test runs at core level. Please let me
> > > know if this solves the suspend issue as well..
> >
> > Ok, I'll try, and I bet it does.
> >
> > HOWEVER.
> >
> > I'd be a whole lot happier knowing exactly which field in "struct
> > device" that needed to be NULL before it gets registered.
> >
> > I don't like how
> >
> > device_register() + device_create_file(dev)..
> >
> > is not sufficiently undone by
> >
> > .. device_remove_file(dev) + device_unregister()
> >
> > so that it can't be repeated. Exactly *what* state is stale and
> > re-used incorrectly if you do that device_register() a second time.
> >
> > It smells like a misfeature of the device core handling.
>
> It has to do with the fact that this is a "static" device that is being
> reused. Normally it would be cleaned up properly in the release
> function, but as there isn't one, some fields are being left in a bad
> state.

That's exactly right. In general, device structures should never be
reused. Apart from the reinitialization issues, in the general case
you have the problem that the references to the previous incarnation
may not all have been dropped. Now, perhaps in the MCE case you _do_
know that they're all gone (I can't tell), but relying on it is
dangerous.

The driver core isn't designed to handle device structures that get
unregistered and then spring back to life; callers are supposed to
allocate a fresh new structure instead. (We had to solve this very
same problem in the USB subsystem a number of years ago; figuring it
all out was tricky even back then.) And this is true regardless of
whether the original structure was allocated dynamically or not.

Alan Stern

Jeff Chua

unread,

Jan 14, 2012, 3:45:36 PM1/14/12

to Alan Stern, Greg KH, Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Sun, Jan 15, 2012 at 12:30 AM, Alan Stern <st...@rowland.harvard.edu> wrote:
> On Sat, 14 Jan 2012, Greg KH wrote:
>
>> On Fri, Jan 13, 2012 at 06:53:04PM -0800, Linus Torvalds wrote:
>> > On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
>> > <srivat...@linux.vnet.ibm.com> wrote:
>> > >
>> > > YES!! Finally I have a fix for this whole MCE thing! :-)

Yeh! This fixed Suspend/Resume fixed too (re Linux-3.2.0 Pid: 1902,
comm: modprobe Not tainted 3.2.0 #64).

Now, I'm onto another issue with bluetooth. Bluetooth works well with
3.0.17 (may be 3.1.xx), but latest git is freezing the whole system
during pairing. Not good. Not sure whether this is related to the MCE.

Jeff.

Tony Luck

unread,

Jan 14, 2012, 9:05:24 PM1/14/12

to Alan Stern, Greg KH, Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Sat, Jan 14, 2012 at 8:30 AM, Alan Stern <st...@rowland.harvard.edu> wrote:
> The driver core isn't designed to handle device structures that get
> unregistered and then spring back to life; callers are supposed to
> allocate a fresh new structure instead.

Does every caller use "kzalloc()" for this fresh new structure (or even
for the original registration) rather than "kmalloc()"?

-Tony

Greg KH

unread,

Jan 14, 2012, 9:34:04 PM1/14/12

to Tony Luck, Alan Stern, Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Sat, Jan 14, 2012 at 06:05:24PM -0800, Tony Luck wrote:
> On Sat, Jan 14, 2012 at 8:30 AM, Alan Stern <st...@rowland.harvard.edu> wrote:
> > The driver core isn't designed to handle device structures that get
> > unregistered and then spring back to life; callers are supposed to
> > allocate a fresh new structure instead.
>
> Does every caller use "kzalloc()" for this fresh new structure (or even
> for the original registration) rather than "kmalloc()"?

They better, as that's what they are required to do :)

The kernel would be crashing with slab debugging enabled, if it wasn't
so today.

thanks,

greg k-h

Alan Stern

unread,

Jan 14, 2012, 10:36:53 PM1/14/12

to Greg KH, Tony Luck, Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Sat, 14 Jan 2012, Greg KH wrote:

> On Sat, Jan 14, 2012 at 06:05:24PM -0800, Tony Luck wrote:
> > On Sat, Jan 14, 2012 at 8:30 AM, Alan Stern <st...@rowland.harvard.edu> wrote:
> > > The driver core isn't designed to handle device structures that get
> > > unregistered and then spring back to life; callers are supposed to
> > > allocate a fresh new structure instead.
> >
> > Does every caller use "kzalloc()" for this fresh new structure (or even
> > for the original registration) rather than "kmalloc()"?
>
> They better, as that's what they are required to do :)
>
> The kernel would be crashing with slab debugging enabled, if it wasn't
> so today.

These are both important restrictions on the use of the driver core,
and they don't seem to be explained anywhere in the kernel source.
Would you like to receive a documentation patch?

Alan Stern

Greg KH

unread,

Jan 16, 2012, 1:11:35 PM1/16/12

to Linus Torvalds, Kay Sievers, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Sat, Jan 14, 2012 at 06:49:38AM -0800, Greg KH wrote:
> On Fri, Jan 13, 2012 at 06:53:04PM -0800, Linus Torvalds wrote:
> > On Fri, Jan 13, 2012 at 6:41 PM, Srivatsa S. Bhat
> > <srivat...@linux.vnet.ibm.com> wrote:
> > >
> > > YES!! Finally I have a fix for this whole MCE thing! :-)
> >
> > Goodie.
> >
> > > The patch below works perfectly for me - I tested multiple CPU hotplug
> > > operations as well as multiple pm_test runs at core level. Please let me
> > > know if this solves the suspend issue as well..
> >
> > Ok, I'll try, and I bet it does.
> >
> > HOWEVER.
> >
> > I'd be a whole lot happier knowing exactly which field in "struct
> > device" that needed to be NULL before it gets registered.
> >
> > I don't like how
> >
> > device_register() + device_create_file(dev)..
> >
> > is not sufficiently undone by
> >
> > .. device_remove_file(dev) + device_unregister()
> >
> > so that it can't be repeated. Exactly *what* state is stale and
> > re-used incorrectly if you do that device_register() a second time.
> >
> > It smells like a misfeature of the device core handling.
>
> It has to do with the fact that this is a "static" device that is being
> reused. Normally it would be cleaned up properly in the release
> function, but as there isn't one, some fields are being left in a bad
> state.

Kay, I looked at this this morning, and it comes down to the line:

DEFINE_PER_CPU(struct device, mce_device);

Where we are creating static struct device variables. I'm guessing this
is just done for "convenience" as we really don't care about where in
memory these structures are, we just want to make sure we have enough of
them around (this is the way all the other mce per-cpu structures are
handled.)

I couldn't figure out a "simple" way to create a variable per cpu here,
dynamically. I tried doing something like:
struct device *mce_device[CONFIG_NR_CPUS];
and dynamically create and clean them up when they go away, setting the
array value to NULL when they are unregistered, and let them clean up in
the release function, but does that race with creating the device again?

It seems that this would work, but I'm probably missing something
obvious here, any ideas?

The "correct" way to fix this up would be to have a per-cpu structure
for all of the different mce things that are created in this driver
(struct device, struct mce, exception counts, work queues, polling
banks, etc.), but that seems pretty messy, and I imagine some of these
want to stay as-is for some performance issues. As I don't know this
code at all, I'm a bit leary to make that kind of change.

Greg KH

unread,

Jan 16, 2012, 1:15:36 PM1/16/12

to Alan Stern, Tony Luck, Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Sat, Jan 14, 2012 at 10:36:53PM -0500, Alan Stern wrote:
> On Sat, 14 Jan 2012, Greg KH wrote:
>
> > On Sat, Jan 14, 2012 at 06:05:24PM -0800, Tony Luck wrote:
> > > On Sat, Jan 14, 2012 at 8:30 AM, Alan Stern <st...@rowland.harvard.edu> wrote:
> > > > The driver core isn't designed to handle device structures that get
> > > > unregistered and then spring back to life; callers are supposed to
> > > > allocate a fresh new structure instead.
> > >
> > > Does every caller use "kzalloc()" for this fresh new structure (or even
> > > for the original registration) rather than "kmalloc()"?
> >
> > They better, as that's what they are required to do :)
> >
> > The kernel would be crashing with slab debugging enabled, if it wasn't
> > so today.
>
> These are both important restrictions on the use of the driver core,
> and they don't seem to be explained anywhere in the kernel source.
> Would you like to receive a documentation patch?

For some reason I thought it was documented somewhere, but I can't seem
to find that right now. So yes, a patch documenting it would be great
to have.

thanks,

greg k-h

Luck, Tony

unread,

Jan 16, 2012, 1:27:16 PM1/16/12

to Greg KH, Linus Torvalds, Kay Sievers, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Siddha, Suresh B, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

> The "correct" way to fix this up would be to have a per-cpu structure
> for all of the different mce things that are created in this driver
> (struct device, struct mce, exception counts, work queues, polling
> banks, etc.), but that seems pretty messy, and I imagine some of these
> want to stay as-is for some performance issues. As I don't know this
> code at all, I'm a bit leary to make that kind of change.

If you get so many machine checks that you care about the performance
of the handler - you may be worrying about the wrong things.

I'm more concerned about maintainability of the code. Seto-san has
submitted many patches re-grouping the functions inside mce.c into
functional areas - keeping the data structures separated makes
sense - especially if there is some goal of splitting mce.c into
separate files.

-Tony

Greg KH

unread,

Jan 16, 2012, 1:34:38 PM1/16/12

to Luck, Tony, Linus Torvalds, Kay Sievers, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Siddha, Suresh B, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Mon, Jan 16, 2012 at 06:27:16PM +0000, Luck, Tony wrote:
> > The "correct" way to fix this up would be to have a per-cpu structure
> > for all of the different mce things that are created in this driver
> > (struct device, struct mce, exception counts, work queues, polling
> > banks, etc.), but that seems pretty messy, and I imagine some of these
> > want to stay as-is for some performance issues. As I don't know this
> > code at all, I'm a bit leary to make that kind of change.
>
> If you get so many machine checks that you care about the performance
> of the handler - you may be worrying about the wrong things.
>
> I'm more concerned about maintainability of the code. Seto-san has
> submitted many patches re-grouping the functions inside mce.c into
> functional areas - keeping the data structures separated makes
> sense - especially if there is some goal of splitting mce.c into
> separate files.

Ok, I'll leave that alone, and just focus on the struct device stuff, as
that I know can't be performance critical :)

thanks,

greg k-h

Kay Sievers

unread,

Jan 16, 2012, 1:42:14 PM1/16/12

to Greg KH, Linus Torvalds, Srivatsa S. Bhat, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Suresh B Siddha, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Mon, Jan 16, 2012 at 19:11, Greg KH <gre...@suse.de> wrote:

>> It has to do with the fact that this is a "static" device that is being
>> reused. Normally it would be cleaned up properly in the release
>> function, but as there isn't one, some fields are being left in a bad
>> state.
>
> Kay, I looked at this this morning, and it comes down to the line:
>
> DEFINE_PER_CPU(struct device, mce_device);
>
> Where we are creating static struct device variables. I'm guessing this
> is just done for "convenience" as we really don't care about where in
> memory these structures are, we just want to make sure we have enough of
> them around (this is the way all the other mce per-cpu structures are
> handled.)

It's a pretty common pattern across the CPU handling code. I don't
know the details, but I doubt many of them are needed or really useful
in the context which they are used. So, it's either convenience or
just 'hoping stuff could be made without managing memory'. :)

Having the 'machinecheck' devices pre-allocated per CPU might not make
too much sense, because the driver core operations will still need to
allocate stuff dynamically at register() time anyway. Attaching
devices to interfaces and drivers of a subsystem will require the same
thing, even the simplest kobject allocates the name dynamically.

I guess, the driver core part ofmachinecheckcould just be fully
dynamic, instead of statically pre-allocated.

Kay

Suresh Siddha

unread,

Jan 16, 2012, 9:21:32 PM1/16/12

to Srivatsa S. Bhat, Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Sat, 2012-01-14 at 08:11 +0530, Srivatsa S. Bhat wrote:
> Of course, the warnings at drivers/base/core.c: device_release()
> as well as the IPI to offline cpu warnings still appear but are rather
> unrelated and harmless to the issue being discussed.

As far the IPI offline cpu warnings are concerned, appended patch should
fix it. Can you please give it a try? Peterz, can you please review and
queue it after Srivatsa confirms that it works? Thanks.
---

From: Suresh Siddha <suresh....@intel.com>
Subject: sched, nohz: fix nohz cpu idle load balancing state with cpu hotplug

With the recent nohz scheduler changes, rq's nohz flag 'NOHZ_TICK_STOPPED'
and its associated state doesn't get cleared immediately after the
cpu exits idle. This gets cleared as part of the next tick seen on that cpu.

With the cpu offline, we need to clear this state manually. Fix it by
registering a cpu notifier which clears the nohz idle load balance
state for this rq explicitly.

Reported-by: Srivatsa S. Bhat <srivat...@linux.vnet.ibm.com>
Signed-off-by: Suresh Siddha <suresh....@intel.com>
---
kernel/sched/fair.c | 40 +++++++++++++++++++++++++++++++++++-----
1 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e42de9..defc92d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4862,6 +4862,15 @@ static void nohz_balancer_kick(int cpu)
return;
}

+static inline void clear_nohz_tick_stopped(int cpu)
+{
+ if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
+ cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
+ atomic_dec(&nohz.nr_cpus);
+ clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
+ }
+}
+
static inline void set_cpu_sd_state_busy(void)
{
struct sched_domain *sd;
@@ -4900,6 +4909,12 @@ void select_nohz_load_balancer(int stop_tick)
{
int cpu = smp_processor_id();

+ /*
+ * If this cpu is going down, then nothing needs to be done.
+ */
+ if (!cpu_active(cpu))
+ return;
+
if (stop_tick) {
if (test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))
return;
@@ -4910,6 +4925,24 @@ void select_nohz_load_balancer(int stop_tick)
}
return;
}
+
+static void __cpuinit __sched_ilb_notifier(void *info)
+{
+ clear_nohz_tick_stopped(smp_processor_id());
+}
+
+static int __cpuinit sched_ilb_notifier(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_DOWN_PREPARE:
+ smp_call_function_single((unsigned long) hcpu,
+ __sched_ilb_notifier, NULL, 1);
+ return NOTIFY_OK;
+ default:
+ return NOTIFY_DONE;
+ }
+}
#endif

static DEFINE_SPINLOCK(balancing);
@@ -5066,11 +5099,7 @@ static inline int nohz_kick_needed(struct rq *rq, int cpu)
* busy tick after returning from idle, we will update the busy stats.
*/
set_cpu_sd_state_busy();
- if (unlikely(test_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)))) {
- clear_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu));
- cpumask_clear_cpu(cpu, nohz.idle_cpus_mask);
- atomic_dec(&nohz.nr_cpus);
- }
+ clear_nohz_tick_stopped(cpu);

/*
* None are in tickless mode and hence no need for NOHZ idle load
@@ -5586,6 +5615,7 @@ __init void init_sched_fair_class(void)

#ifdef CONFIG_NO_HZ
zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
+ cpu_notifier(sched_ilb_notifier, 0);
#endif
#endif /* SMP */

Srivatsa S. Bhat

unread,

Jan 17, 2012, 4:52:42 AM1/17/12

to Suresh Siddha, Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On 01/17/2012 07:51 AM, Suresh Siddha wrote:

> On Sat, 2012-01-14 at 08:11 +0530, Srivatsa S. Bhat wrote:
>> Of course, the warnings at drivers/base/core.c: device_release()
>> as well as the IPI to offline cpu warnings still appear but are rather
>> unrelated and harmless to the issue being discussed.
>
> As far the IPI offline cpu warnings are concerned, appended patch should
> fix it. Can you please give it a try? Peterz, can you please review and
> queue it after Srivatsa confirms that it works? Thanks.

Hi Suresh,

Thanks for the patch, but unfortunately it doesn't fix the problem!
Exactly the same stack traces are seen during a CPU Hotplug stress test.
(I didn't even have to stress it - it is so fragile that just a script
to offline all cpus except the boot cpu was good enough to reproduce the
problem easily.)

[ 562.269083] ------------[ cut here ]------------
[ 562.273079] WARNING: at arch/x86/kernel/smp.c:120 native_smp_send_reschedule+0x59/0x60()
[ 562.273079] Hardware name: IBM System x -[7870C4Q]-
[ 562.273079] Modules linked in: ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf microcode fuse loop dm_mod iTCO_wdt i7core_edac i2c_i801 ioatdma cdc_ether i2c_core tpm_tis bnx2 shpchp usbnet pcspkr mii iTCO_vendor_support edac_core serio_raw dca sg rtc_cmos tpm tpm_bios pci_hotplug button uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif edd ext3 mbcache jbd fan processor mptsas mptscsih mptbase scsi_transport_sas scsi_mod thermal thermal_sys hwmon
[ 562.273079] Pid: 6, comm: migration/0 Not tainted 3.2.0-sureshipi-0.0.0.28.36b5ec9-default #2
[ 562.273079] Call Trace:
[ 562.273079] <IRQ> [<ffffffff810213d9>] ? native_smp_send_reschedule+0x59/0x60
[ 562.273079] [<ffffffff8103cf4a>] warn_slowpath_common+0x7a/0xb0
[ 562.273079] [<ffffffff8103cf95>] warn_slowpath_null+0x15/0x20
[ 562.273079] [<ffffffff810213d9>] native_smp_send_reschedule+0x59/0x60
[ 562.273079] [<ffffffff81082d65>] trigger_load_balance+0x185/0x500
[ 562.273079] [<ffffffff81082d9b>] ? trigger_load_balance+0x1bb/0x500
[ 562.273079] [<ffffffff81073db7>] scheduler_tick+0x107/0x170
[ 562.273079] [<ffffffff8104e6f7>] update_process_times+0x67/0x80
[ 562.273079] [<ffffffff8109c64f>] tick_sched_timer+0x5f/0xc0
[ 562.273079] [<ffffffff8109c5f0>] ? tick_nohz_handler+0x100/0x100
[ 562.273079] [<ffffffff8106a85e>] __run_hrtimer+0x12e/0x330
[ 562.273079] [<ffffffff8106aca7>] hrtimer_interrupt+0xc7/0x1f0
[ 562.273079] [<ffffffff81022ff4>] smp_apic_timer_interrupt+0x64/0xa0
[ 562.273079] [<ffffffff814a2a33>] apic_timer_interrupt+0x73/0x80
[ 562.273079] <EOI> [<ffffffff810c563a>] ? stop_machine_cpu_stop+0xda/0x130
[ 562.273079] [<ffffffff810c5560>] ? stop_one_cpu_nowait+0x50/0x50
[ 562.273079] [<ffffffff810c5279>] cpu_stopper_thread+0xd9/0x1b0
[ 562.273079] [<ffffffff81498ddf>] ? _raw_spin_unlock_irqrestore+0x3f/0x80
[ 562.273079] [<ffffffff810c51a0>] ? res_counter_init+0x50/0x50
[ 562.273079] [<ffffffff810a2add>] ? trace_hardirqs_on_caller+0x12d/0x1b0
[ 562.273079] [<ffffffff810a2b6d>] ? trace_hardirqs_on+0xd/0x10
[ 562.273079] [<ffffffff810c51a0>] ? res_counter_init+0x50/0x50
[ 562.273079] [<ffffffff8106553e>] kthread+0x9e/0xb0
[ 562.273079] [<ffffffff814a3334>] kernel_thread_helper+0x4/0x10
[ 562.273079] [<ffffffff81499174>] ? retint_restore_args+0x13/0x13
[ 562.273079] [<ffffffff810654a0>] ? __init_kthread_worker+0x70/0x70
[ 562.273079] [<ffffffff814a3330>] ? gs_change+0x13/0x13
[ 562.273079] ---[ end trace 4efec5b2532b902d ]---

I have a few questions regarding the synchronization with CPU Hotplug.
What guarantees that the code which selects and IPIs the new ilb is totally
race-free with respect to CPU hotplug and we will never IPI an offline CPU?

(In 3.2-rc7 I hadn't hit the IPI to offline cpu issue (the above stack trace)
as far as I remember.)

While trying to figure out what changed in the 3.3 merge window, I added a
WARN_ON in the 3.2-rc7 kernel as shown below:

static void nohz_balancer_kick(int cpu)
{
....

if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
cpu_rq(ilb_cpu)->nohz_balance_kick = 1;

smp_mb();
/*
* Use smp_send_reschedule() instead of resched_cpu().
* This way we generate a sched IPI on the target cpu which
* is idle. And the softirq performing nohz idle load balance
* will be run before returning from the IPI.
*/
==========> if (!cpu_active(ilb_cpu))
==========> WARN_ON(1);
smp_send_reschedule(ilb_cpu);
}
return;
}

As expected, I hit this warning during my CPU hotplug stress tests. I am sure
this happens on latest kernel too (3.3 merge window), since there is
apparently no change in that part of code in that aspect.

So, while selecting the new ilb, why are we not careful enough to ensure we
don't select a cpu that is going offline? Is this by design (to avoid some
overhead) or is this a bug? (As demonstrated above, this issue is in 3.2-rc7
as well.)

And the only reason I can think why we did not hit the "IPI to offline CPU"
issue in 3.2-rc7 kernel is that the race window (with CPU offline) was
probably too small and _not_ because we explicitly synchronized with CPU
Hotplug.

Probably I am missing something obvious... I would be grateful if you could
kindly help me understand how this works..

Regards,
Srivatsa S. Bhat

Jeff Chua

unread,

Jan 17, 2012, 11:15:38 AM1/17/12

to Srivatsa S. Bhat, Suresh Siddha, Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Tue, Jan 17, 2012 at 5:52 PM, Srivatsa S. Bhat
<srivat...@linux.vnet.ibm.com> wrote:
> On 01/17/2012 07:51 AM, Suresh Siddha wrote:
>
>> On Sat, 2012-01-14 at 08:11 +0530, Srivatsa S. Bhat wrote:
>>> Of course, the warnings at drivers/base/core.c: device_release()
>>> as well as the IPI to offline cpu warnings still appear but are rather
>>> unrelated and harmless to the issue being discussed.
>>
>> As far the IPI offline cpu warnings are concerned, appended patch should
>> fix it. Can you please give it a try? Peterz, can you please review and
>> queue it after Srivatsa confirms that it works? Thanks.
>
>
> Hi Suresh,
>
> Thanks for the patch, but unfortunately it doesn't fix the problem!
> Exactly the same stack traces are seen during a CPU Hotplug stress test.
> (I didn't even have to stress it - it is so fragile that just a script
> to offline all cpus except the boot cpu was good enough to reproduce the
> problem easily.)

Works for me. But I'm still seeing this only during boot. Related?
Shall I bisect?

Freeing unused kernel memory: 520k freed
Write protecting the kernel read-only data: 8192k
Freeing unused kernel memory: 1140k freed
Freeing unused kernel memory: 464k freed
Adding 8290300k swap on /dev/sda3. Priority:-1 extents:1 across:8290300k SS
vmalloc: allocation failure: 0 bytes
modprobe: page allocation failure: order:0, mode:0xd2
Pid: 1914, comm: modprobe Not tainted 3.2.0 #6
Call Trace:
[<ffffffff8107c1ff>] ? 0xffffffff8107c1ff
[<ffffffff81061fec>] ? 0xffffffff81061fec
[<ffffffff8109ab6c>] ? 0xffffffff8109ab6c
[<ffffffff81061fec>] ? 0xffffffff81061fec
[<ffffffff8101bacc>] ? 0xffffffff8101bacc
[<ffffffff81061fec>] ? 0xffffffff81061fec
[<ffffffff81061fec>] ? 0xffffffff81061fec
[<ffffffff81062ec8>] ? 0xffffffff81062ec8
[<ffffffff810637c1>] ? 0xffffffff810637c1
[<ffffffff814d9cb9>] ? 0xffffffff814d9cb9
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
CPU 2: hi: 0, btch: 1 usd: 0
CPU 3: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 158
CPU 1: hi: 186, btch: 31 usd: 25
CPU 2: hi: 186, btch: 31 usd: 0
CPU 3: hi: 186, btch: 31 usd: 0
Node 0 Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 93
CPU 1: hi: 186, btch: 31 usd: 74
CPU 2: hi: 186, btch: 31 usd: 170
CPU 3: hi: 186, btch: 31 usd: 60
active_anon:6162 inactive_anon:1 isolated_anon:0
active_file:1782 inactive_file:5164 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:1963131 slab_reclaimable:818 slab_unreclaimable:2728
mapped:1639 shmem:3 pagetables:292 bounce:0

Thanks,
Jeff

Srivatsa S. Bhat

unread,

Jan 17, 2012, 11:36:55 AM1/17/12

to Jeff Chua, Suresh Siddha, Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

This is a different problem. Not the same as the one Suresh's patch
intended to fix. Your case has something to do with memory allocation
failures. The problem I am facing is Inter-Processor Interrupts (IPIs)
being sent to CPUs that are going offline, after selecting them as the
new ilb (Idle load balancer).

Regards,
Srivatsa S. Bhat
IBM Linux Technology Center

--

Suresh Siddha

unread,

Jan 17, 2012, 10:17:04 PM1/17/12

to Srivatsa S. Bhat, Linus Torvalds, Ming Lei, Djalal Harouni, Borislav Petkov, Tony Luck, Hidetoshi Seto, Ingo Molnar, Andi Kleen, linux-...@vger.kernel.org, Greg Kroah-Hartman, Kay Sievers, gou...@et.bocholt.fh-gelsenkirchen.de, Marcos Souza, Linux PM mailing list, Rafael J. Wysocki, tg...@linutronix.de, pra...@linux.vnet.ibm.com, justin...@gmail.com, Jeff Chua, Peter Zijlstra, Mel Gorman, Gilad Ben-Yossef

On Tue, 2012-01-17 at 15:22 +0530, Srivatsa S. Bhat wrote:
> Thanks for the patch, but unfortunately it doesn't fix the problem!
> Exactly the same stack traces are seen during a CPU Hotplug stress test.
> (I didn't even have to stress it - it is so fragile that just a script
> to offline all cpus except the boot cpu was good enough to reproduce the
> problem easily.)

hmm, that's weird. with the patch, sched_ilb_notifier() should have
cleared the cpu going offline from the nohz.idle_cpus_mask. And this
should have happened after that cpu is removed from active mask. So
no-one else should add that cpu back to the nohz.idle_cpus_mask and this
should prevent the issue from happening.

I could reproduce the problem easily with out the patch but when I
applied the patch I couldn't recreate the issue. Srivatsa, can you
please re-check the kernel you tested indeed has the fix?

re-Reviewing the code/patch also doesn't give me a hint.

> I have a few questions regarding the synchronization with CPU Hotplug.
> What guarantees that the code which selects and IPIs the new ilb is totally
> race-free with respect to CPU hotplug and we will never IPI an offline CPU?

So, nohz_balancer_kick() gets called only from interrupts disabled.
During that time (from selecting the ilb_cpu to sending the IPI), no cpu
can go offline. As the offline happens from the stop-machine process
context with interrupts disabled.

Only thing we need to make sure is the offlined cpu shouldn't be part of
the nohz.idle_cpus_mask and for post 3.2 code, posted patch ensures
that.

For 3.2 and before, when a cpu exits tickless idle, it gets removed from
the nohz.idle_cpus_mask (and also from the nohz.load_balancer). And if
the cpu is not in the active mask (while going offline), subsequent
calls to select_nohz_load_balancer() ensures that the cpu going down
doesn't update the nohz structures. So I thought 3.2 shouldn't exhibit
this problem.

> (As demonstrated above, this issue is in 3.2-rc7
> as well.)

hmm, don't think we ran into this before 3.2. So, what am I missing from
the above? I will try to reproduce it on 3.2 too.

thanks,
suresh

Reply all

Reply to author

Forward