mm: deadlock between get_online_cpus/pcpu

Dmitry Vyukov

unread,

Jan 29, 2017, 7:45:07 AM1/29/17

to Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller

Hello,

I've got the following deadlock report while running syzkaller fuzzer
on f37208bc3c9c2f811460ef264909dfbc7f605a60:

[ INFO: possible circular locking dependency detected ]
4.10.0-rc5-next-20170125 #1 Not tainted
-------------------------------------------------------
syz-executor3/14255 is trying to acquire lock:
(cpu_hotplug.dep_map){++++++}, at: [<ffffffff814271c7>]
get_online_cpus+0x37/0x90 kernel/cpu.c:239

but task is already holding lock:
(pcpu_alloc_mutex){+.+.+.}, at: [<ffffffff81937fee>]
pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (pcpu_alloc_mutex){+.+.+.}:

[<ffffffff8157a169>] validate_chain kernel/locking/lockdep.c:2265 [inline]
[<ffffffff8157a169>] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
[<ffffffff8157c2f1>] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
[<ffffffff8447fb62>] __mutex_lock_common kernel/locking/mutex.c:757 [inline]
[<ffffffff8447fb62>] __mutex_lock+0x382/0x25c0 kernel/locking/mutex.c:894
[<ffffffff84481db6>] mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:909
[<ffffffff81937fee>] pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
[<ffffffff819386d4>] __alloc_percpu+0x24/0x30 mm/percpu.c:1076
[<ffffffff81684963>] smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:47
[<ffffffff81428296>] cpuhp_invoke_callback+0x256/0x1480 kernel/cpu.c:136
[<ffffffff81429a11>] cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:425
[<ffffffff8142bd53>] _cpu_up+0x1e3/0x2a0 kernel/cpu.c:940
[<ffffffff8142be83>] do_cpu_up+0x73/0xa0 kernel/cpu.c:970
[<ffffffff8142bec8>] cpu_up+0x18/0x20 kernel/cpu.c:978
[<ffffffff8570eec9>] smp_init+0x148/0x160 kernel/smp.c:565
[<ffffffff856a2fef>] kernel_init_freeable+0x43e/0x695 init/main.c:1026
[<ffffffff8446ccf3>] kernel_init+0x13/0x180 init/main.c:955
[<ffffffff8448f0b1>] ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430

-> #1 (cpu_hotplug.lock){+.+.+.}:

[<ffffffff8157a169>] validate_chain kernel/locking/lockdep.c:2265 [inline]
[<ffffffff8157a169>] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
[<ffffffff8157c2f1>] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
[<ffffffff8447fb62>] __mutex_lock_common kernel/locking/mutex.c:757 [inline]
[<ffffffff8447fb62>] __mutex_lock+0x382/0x25c0 kernel/locking/mutex.c:894
[<ffffffff84481db6>] mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:909
[<ffffffff8142b9d6>] cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:297
[<ffffffff8142bc3a>] _cpu_up+0xca/0x2a0 kernel/cpu.c:894
[<ffffffff8142be83>] do_cpu_up+0x73/0xa0 kernel/cpu.c:970
[<ffffffff8142bec8>] cpu_up+0x18/0x20 kernel/cpu.c:978
[<ffffffff8570eec9>] smp_init+0x148/0x160 kernel/smp.c:565
[<ffffffff856a2fef>] kernel_init_freeable+0x43e/0x695 init/main.c:1026
[<ffffffff8446ccf3>] kernel_init+0x13/0x180 init/main.c:955
[<ffffffff8448f0b1>] ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430

-> #0 (cpu_hotplug.dep_map){++++++}:

[<ffffffff81573ebf>] check_prev_add kernel/locking/lockdep.c:1828 [inline]
[<ffffffff81573ebf>] check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1938
[<ffffffff8157a169>] validate_chain kernel/locking/lockdep.c:2265 [inline]
[<ffffffff8157a169>] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
[<ffffffff8157c2f1>] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
[<ffffffff814271f2>] get_online_cpus+0x62/0x90 kernel/cpu.c:241
[<ffffffff818991ec>] drain_all_pages.part.98+0x8c/0x8f0 mm/page_alloc.c:2371
[<ffffffff818adac6>] drain_all_pages mm/page_alloc.c:2364 [inline]
[<ffffffff818adac6>] __alloc_pages_direct_reclaim mm/page_alloc.c:3435 [inline]
[<ffffffff818adac6>] __alloc_pages_slowpath+0x966/0x23d0 mm/page_alloc.c:3773
[<ffffffff818afe25>] __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3975
[<ffffffff819348a1>] __alloc_pages include/linux/gfp.h:426 [inline]
[<ffffffff819348a1>] __alloc_pages_node include/linux/gfp.h:439 [inline]
[<ffffffff819348a1>] alloc_pages_node include/linux/gfp.h:453 [inline]
[<ffffffff819348a1>] pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
[<ffffffff819348a1>] pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
[<ffffffff81938205>] pcpu_alloc+0xe15/0x1290 mm/percpu.c:999
[<ffffffff819386a7>] __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1063
[<ffffffff81811913>] bpf_array_alloc_percpu kernel/bpf/arraymap.c:33 [inline]
[<ffffffff81811913>] array_map_alloc+0x543/0x700 kernel/bpf/arraymap.c:94
[<ffffffff817f53cd>] find_and_alloc_map kernel/bpf/syscall.c:37 [inline]
[<ffffffff817f53cd>] map_create kernel/bpf/syscall.c:228 [inline]
[<ffffffff817f53cd>] SYSC_bpf kernel/bpf/syscall.c:1040 [inline]
[<ffffffff817f53cd>] SyS_bpf+0x108d/0x27c0 kernel/bpf/syscall.c:997
[<ffffffff8448ee41>] entry_SYSCALL_64_fastpath+0x1f/0xc2

other info that might help us debug this:

Chain exists of:
cpu_hotplug.dep_map --> cpu_hotplug.lock --> pcpu_alloc_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(pcpu_alloc_mutex);
lock(cpu_hotplug.lock);
lock(pcpu_alloc_mutex);
lock(cpu_hotplug.dep_map);

*** DEADLOCK ***

1 lock held by syz-executor3/14255:
#0: (pcpu_alloc_mutex){+.+.+.}, at: [<ffffffff81937fee>]
pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897

stack backtrace:
CPU: 1 PID: 14255 Comm: syz-executor3 Not tainted 4.10.0-rc5-next-20170125 #1
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
print_circular_bug+0x307/0x3b0 kernel/locking/lockdep.c:1202
check_prev_add kernel/locking/lockdep.c:1828 [inline]
check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1938
validate_chain kernel/locking/lockdep.c:2265 [inline]
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
get_online_cpus+0x62/0x90 kernel/cpu.c:241
drain_all_pages.part.98+0x8c/0x8f0 mm/page_alloc.c:2371
drain_all_pages mm/page_alloc.c:2364 [inline]
__alloc_pages_direct_reclaim mm/page_alloc.c:3435 [inline]
__alloc_pages_slowpath+0x966/0x23d0 mm/page_alloc.c:3773
__alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3975
__alloc_pages include/linux/gfp.h:426 [inline]
__alloc_pages_node include/linux/gfp.h:439 [inline]
alloc_pages_node include/linux/gfp.h:453 [inline]
pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
pcpu_alloc+0xe15/0x1290 mm/percpu.c:999
__alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1063
bpf_array_alloc_percpu kernel/bpf/arraymap.c:33 [inline]
array_map_alloc+0x543/0x700 kernel/bpf/arraymap.c:94
find_and_alloc_map kernel/bpf/syscall.c:37 [inline]
map_create kernel/bpf/syscall.c:228 [inline]
SYSC_bpf kernel/bpf/syscall.c:1040 [inline]
SyS_bpf+0x108d/0x27c0 kernel/bpf/syscall.c:997
entry_SYSCALL_64_fastpath+0x1f/0xc2

Vlastimil Babka

unread,

Jan 29, 2017, 12:22:52 PM1/29/17

to Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Mel Gorman, Michal Hocko, Andrew Morton

On 29.1.2017 13:44, Dmitry Vyukov wrote:
> Hello,
>
> I've got the following deadlock report while running syzkaller fuzzer
> on f37208bc3c9c2f811460ef264909dfbc7f605a60:
>
> [ INFO: possible circular locking dependency detected ]
> 4.10.0-rc5-next-20170125 #1 Not tainted
> -------------------------------------------------------
> syz-executor3/14255 is trying to acquire lock:
> (cpu_hotplug.dep_map){++++++}, at: [<ffffffff814271c7>]
> get_online_cpus+0x37/0x90 kernel/cpu.c:239
>
> but task is already holding lock:
> (pcpu_alloc_mutex){+.+.+.}, at: [<ffffffff81937fee>]
> pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
>
> which lock already depends on the new lock.

I suspect the dependency comes from recent changes in drain_all_pages(). They
were later redone (for other reasons, but nice to have another validation) in
the mmots patch [1], which AFAICS is not yet in mmotm and thus linux-next. Could
you try if it helps?

Vlastimil

[1]
http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-use-static-global-work_struct-for-draining-per-cpu-pages.patch

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>
>

Dmitry Vyukov

unread,

Jan 30, 2017, 10:48:29 AM1/30/17

to Vlastimil Babka, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Mel Gorman, Michal Hocko, Andrew Morton

On Sun, Jan 29, 2017 at 6:22 PM, Vlastimil Babka <vba...@suse.cz> wrote:
> On 29.1.2017 13:44, Dmitry Vyukov wrote:
>> Hello,
>>
>> I've got the following deadlock report while running syzkaller fuzzer
>> on f37208bc3c9c2f811460ef264909dfbc7f605a60:
>>
>> [ INFO: possible circular locking dependency detected ]
>> 4.10.0-rc5-next-20170125 #1 Not tainted
>> -------------------------------------------------------
>> syz-executor3/14255 is trying to acquire lock:
>> (cpu_hotplug.dep_map){++++++}, at: [<ffffffff814271c7>]
>> get_online_cpus+0x37/0x90 kernel/cpu.c:239
>>
>> but task is already holding lock:
>> (pcpu_alloc_mutex){+.+.+.}, at: [<ffffffff81937fee>]
>> pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
>>
>> which lock already depends on the new lock.
>
> I suspect the dependency comes from recent changes in drain_all_pages(). They
> were later redone (for other reasons, but nice to have another validation) in
> the mmots patch [1], which AFAICS is not yet in mmotm and thus linux-next. Could
> you try if it helps?

It happened only once on linux-next, so I can't verify the fix. But I
will watch out for other occurrences.

Dmitry Vyukov

unread,

Feb 6, 2017, 2:13:57 PM2/6/17

to Vlastimil Babka, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Mel Gorman, Michal Hocko, Andrew Morton

Unfortunately it does not seem to help.
Fuzzer now runs on 510948533b059f4f5033464f9f4a0c32d4ab0c08 of
mmotm/auto-latest
(git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git):

commit 510948533b059f4f5033464f9f4a0c32d4ab0c08
Date: Thu Feb 2 10:08:47 2017 +0100
mmotm: userfaultfd-non-cooperative-add-event-for-memory-unmaps-fix

The commit you referenced is already there:

commit 806b158031ca0b4714e775898396529a758ebc2c
Date: Thu Feb 2 08:53:16 2017 +0100
mm, page_alloc: use static global work_struct for draining per-cpu pages

But I still got:

[ INFO: possible circular locking dependency detected ]

4.9.0 #6 Not tainted
-------------------------------------------------------
syz-executor1/8199 is trying to acquire lock:
(cpu_hotplug.dep_map){++++++}, at: [<ffffffff81422fe7>]
get_online_cpus+0x37/0x90 kernel/cpu.c:246

but task is already holding lock:

(pcpu_alloc_mutex){+.+.+.}, at: [<ffffffff818f07ea>]
pcpu_alloc+0xbda/0x1280 mm/percpu.c:896

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

[ 403.953319] [<ffffffff8156fc29>] validate_chain
kernel/locking/lockdep.c:2265 [inline]
[ 403.953319] [<ffffffff8156fc29>]
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
[ 403.961232] [<ffffffff81571db1>] lock_acquire+0x2a1/0x630
kernel/locking/lockdep.c:3753
[ 403.968788] [<ffffffff8436697e>] __mutex_lock_common
kernel/locking/mutex.c:521 [inline]
[ 403.968788] [<ffffffff8436697e>]
mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
[ 403.976782] [<ffffffff818f07ea>] pcpu_alloc+0xbda/0x1280
mm/percpu.c:896
[ 403.984266] [<ffffffff818f0ee4>] __alloc_percpu+0x24/0x30
mm/percpu.c:1075
[ 403.991873] [<ffffffff816543e3>]
smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44
[ 403.999799] [<ffffffff814240b4>]
cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136
[ 404.008253] [<ffffffff81425821>]
cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493
[ 404.016365] [<ffffffff81427bf3>] _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
[ 404.023507] [<ffffffff81427d23>] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
[ 404.030647] [<ffffffff81427d68>] cpu_up+0x18/0x20 kernel/cpu.c:1095
[ 404.037523] [<ffffffff854ede84>] smp_init+0xe9/0xee kernel/smp.c:564
[ 404.044559] [<ffffffff85482f81>]
kernel_init_freeable+0x439/0x690 init/main.c:1010
[ 404.052811] [<ffffffff84357083>] kernel_init+0x13/0x180
init/main.c:941
[ 404.060198] [<ffffffff84377baa>] ret_from_fork+0x2a/0x40
arch/x86/entry/entry_64.S:433

[ 404.072827] [<ffffffff8156fc29>] validate_chain
kernel/locking/lockdep.c:2265 [inline]
[ 404.072827] [<ffffffff8156fc29>]
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
[ 404.080733] [<ffffffff81571db1>] lock_acquire+0x2a1/0x630
kernel/locking/lockdep.c:3753
[ 404.088311] [<ffffffff8436697e>] __mutex_lock_common
kernel/locking/mutex.c:521 [inline]
[ 404.088311] [<ffffffff8436697e>]
mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
[ 404.096318] [<ffffffff81427876>]
cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304
[ 404.104321] [<ffffffff81427ada>] _cpu_up+0xca/0x2a0 kernel/cpu.c:1011
[ 404.111357] [<ffffffff81427d23>] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
[ 404.118480] [<ffffffff81427d68>] cpu_up+0x18/0x20 kernel/cpu.c:1095
[ 404.125360] [<ffffffff854ede84>] smp_init+0xe9/0xee kernel/smp.c:564
[ 404.132393] [<ffffffff85482f81>]
kernel_init_freeable+0x439/0x690 init/main.c:1010
[ 404.140668] [<ffffffff84357083>] kernel_init+0x13/0x180
init/main.c:941
[ 404.148079] [<ffffffff84377baa>] ret_from_fork+0x2a/0x40
arch/x86/entry/entry_64.S:433

[ 404.160977] [<ffffffff8156976d>] check_prev_add
kernel/locking/lockdep.c:1828 [inline]
[ 404.160977] [<ffffffff8156976d>]
check_prevs_add+0xa8d/0x1c00 kernel/locking/lockdep.c:1938
[ 404.168898] [<ffffffff8156fc29>] validate_chain
kernel/locking/lockdep.c:2265 [inline]
[ 404.168898] [<ffffffff8156fc29>]
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
[ 404.176844] [<ffffffff81571db1>] lock_acquire+0x2a1/0x630
kernel/locking/lockdep.c:3753
[ 404.184416] [<ffffffff81423012>] get_online_cpus+0x62/0x90
kernel/cpu.c:248
[ 404.192103] [<ffffffff8185fcf8>] drain_all_pages+0xf8/0x710
mm/page_alloc.c:2385
[ 404.199880] [<ffffffff81865e5d>]
__alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline]
[ 404.199880] [<ffffffff81865e5d>]
__alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778
[ 404.208406] [<ffffffff818681c5>]
__alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980
[ 404.216851] [<ffffffff818ed0c1>] __alloc_pages
include/linux/gfp.h:426 [inline]
[ 404.216851] [<ffffffff818ed0c1>] __alloc_pages_node
include/linux/gfp.h:439 [inline]
[ 404.216851] [<ffffffff818ed0c1>] alloc_pages_node
include/linux/gfp.h:453 [inline]
[ 404.216851] [<ffffffff818ed0c1>] pcpu_alloc_pages
mm/percpu-vm.c:93 [inline]
[ 404.216851] [<ffffffff818ed0c1>]
pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
[ 404.225015] [<ffffffff818f0a11>] pcpu_alloc+0xe01/0x1280
mm/percpu.c:998
[ 404.232482] [<ffffffff818f0eb7>]
__alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062
[ 404.240389] [<ffffffff817d25b2>] bpf_array_alloc_percpu
kernel/bpf/arraymap.c:34 [inline]
[ 404.240389] [<ffffffff817d25b2>] array_map_alloc+0x532/0x710
kernel/bpf/arraymap.c:99
[ 404.248224] [<ffffffff817ba034>] find_and_alloc_map
kernel/bpf/syscall.c:34 [inline]
[ 404.248224] [<ffffffff817ba034>] map_create
kernel/bpf/syscall.c:188 [inline]
[ 404.248224] [<ffffffff817ba034>] SYSC_bpf
kernel/bpf/syscall.c:870 [inline]
[ 404.248224] [<ffffffff817ba034>] SyS_bpf+0xd64/0x2500
kernel/bpf/syscall.c:827
[ 404.255434] [<ffffffff84377941>] entry_SYSCALL_64_fastpath+0x1f/0xc2

other info that might help us debug this:

Chain exists of:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(pcpu_alloc_mutex);
lock(cpu_hotplug.lock);
lock(pcpu_alloc_mutex);
lock(cpu_hotplug.dep_map);

*** DEADLOCK ***

2 locks held by syz-executor1/8199:
#0: (pcpu_alloc_mutex){+.+.+.}, at: [<ffffffff818f07ea>]
pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
#1: (pcpu_drain_mutex){+.+...}, at: [<ffffffff8185fcd7>]
drain_all_pages+0xd7/0x710 mm/page_alloc.c:2375

stack backtrace:
CPU: 0 PID: 8199 Comm: syz-executor1 Not tainted 4.9.0 #6

Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011

ffff88017ea4e118 ffffffff8234d0df ffffffff00000000 1ffff1002fd49bb6
ffffed002fd49bae 0000000041b58ab3 ffffffff84b38180 ffffffff8234cdf1
ffffffff84b00510 ffffffff81560170 ffff88018ab02200 0000000041b58ab3
Call Trace:
[<ffffffff8234d0df>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff8234d0df>] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
[<ffffffff815673e7>] print_circular_bug+0x307/0x3b0
kernel/locking/lockdep.c:1202
[<ffffffff8156976d>] check_prev_add kernel/locking/lockdep.c:1828 [inline]
[<ffffffff8156976d>] check_prevs_add+0xa8d/0x1c00 kernel/locking/lockdep.c:1938
[<ffffffff8156fc29>] validate_chain kernel/locking/lockdep.c:2265 [inline]
[<ffffffff8156fc29>] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
[<ffffffff81571db1>] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
[<ffffffff81423012>] get_online_cpus+0x62/0x90 kernel/cpu.c:248
[<ffffffff8185fcf8>] drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385
[<ffffffff81865e5d>] __alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline]
[<ffffffff81865e5d>] __alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778
[<ffffffff818681c5>] __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980
[<ffffffff818ed0c1>] __alloc_pages include/linux/gfp.h:426 [inline]
[<ffffffff818ed0c1>] __alloc_pages_node include/linux/gfp.h:439 [inline]
[<ffffffff818ed0c1>] alloc_pages_node include/linux/gfp.h:453 [inline]
[<ffffffff818ed0c1>] pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
[<ffffffff818ed0c1>] pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
[<ffffffff818f0a11>] pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
[<ffffffff818f0eb7>] __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062
[<ffffffff817d25b2>] bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline]
[<ffffffff817d25b2>] array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99
[<ffffffff817ba034>] find_and_alloc_map kernel/bpf/syscall.c:34 [inline]
[<ffffffff817ba034>] map_create kernel/bpf/syscall.c:188 [inline]
[<ffffffff817ba034>] SYSC_bpf kernel/bpf/syscall.c:870 [inline]
[<ffffffff817ba034>] SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827
[<ffffffff84377941>] entry_SYSCALL_64_fastpath+0x1f/0xc2
syz-executor1: page allocation failure: order:0,
mode:0x14001c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_COLD), nodemask=(null)
syz-executor1 cpuset=/ mems_allowed=0
CPU: 0 PID: 8199 Comm: syz-executor1 Not tainted 4.9.0 #6

Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011

ffff88017ea4eb80 ffffffff8234d0df ffffffff00000000 1ffff1002fd49d03
ffffed002fd49cfb 0000000041b58ab3 ffffffff84b38180 ffffffff8234cdf1
0000000000000282 ffffffff84fd53c0 ffff8801dae65b38 ffff88017ea4e7b8
Call Trace:
[<ffffffff8234d0df>] __dump_stack lib/dump_stack.c:15 [inline]
[<ffffffff8234d0df>] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
[<ffffffff8186530f>] warn_alloc+0x21f/0x360 mm/page_alloc.c:3126
[<ffffffff818671f8>] __alloc_pages_slowpath+0x1c98/0x2370 mm/page_alloc.c:3890
[<ffffffff818681c5>] __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980
[<ffffffff818ed0c1>] __alloc_pages include/linux/gfp.h:426 [inline]
[<ffffffff818ed0c1>] __alloc_pages_node include/linux/gfp.h:439 [inline]
[<ffffffff818ed0c1>] alloc_pages_node include/linux/gfp.h:453 [inline]
[<ffffffff818ed0c1>] pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
[<ffffffff818ed0c1>] pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
[<ffffffff818f0a11>] pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
[<ffffffff818f0eb7>] __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062
[<ffffffff817d25b2>] bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline]
[<ffffffff817d25b2>] array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99
[<ffffffff817ba034>] find_and_alloc_map kernel/bpf/syscall.c:34 [inline]
[<ffffffff817ba034>] map_create kernel/bpf/syscall.c:188 [inline]
[<ffffffff817ba034>] SYSC_bpf kernel/bpf/syscall.c:870 [inline]
[<ffffffff817ba034>] SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827
[<ffffffff84377941>] entry_SYSCALL_64_fastpath+0x1f/0xc2

Mel Gorman

unread,

Feb 6, 2017, 5:05:33 PM2/6/17

to Dmitry Vyukov, Vlastimil Babka, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Michal Hocko, Andrew Morton

On Mon, Feb 06, 2017 at 08:13:35PM +0100, Dmitry Vyukov wrote:
> On Mon, Jan 30, 2017 at 4:48 PM, Dmitry Vyukov <dvy...@google.com> wrote:
> > On Sun, Jan 29, 2017 at 6:22 PM, Vlastimil Babka <vba...@suse.cz> wrote:
> >> On 29.1.2017 13:44, Dmitry Vyukov wrote:
> >>> Hello,
> >>>
> >>> I've got the following deadlock report while running syzkaller fuzzer
> >>> on f37208bc3c9c2f811460ef264909dfbc7f605a60:
> >>>
> >>> [ INFO: possible circular locking dependency detected ]
> >>> 4.10.0-rc5-next-20170125 #1 Not tainted
> >>> -------------------------------------------------------
> >>> syz-executor3/14255 is trying to acquire lock:
> >>> (cpu_hotplug.dep_map){++++++}, at: [<ffffffff814271c7>]
> >>> get_online_cpus+0x37/0x90 kernel/cpu.c:239
> >>>
> >>> but task is already holding lock:
> >>> (pcpu_alloc_mutex){+.+.+.}, at: [<ffffffff81937fee>]
> >>> pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
> >>>
> >>> which lock already depends on the new lock.
> >>
> >> I suspect the dependency comes from recent changes in drain_all_pages(). They
> >> were later redone (for other reasons, but nice to have another validation) in
> >> the mmots patch [1], which AFAICS is not yet in mmotm and thus linux-next. Could
> >> you try if it helps?
> >
> > It happened only once on linux-next, so I can't verify the fix. But I
> > will watch out for other occurrences.
>
> Unfortunately it does not seem to help.

I'm a little stuck on how to best handle this. get_online_cpus() can
halt forever if the hotplug operation is holding the mutex when calling
pcpu_alloc. One option would be to add a try_get_online_cpus() helper which
trylocks the mutex. However, given that drain is so unlikely to actually
make that make a difference when racing against parallel allocations,
I think this should be acceptable.

Any objections?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b93879990fd..a3192447e906 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3432,7 +3432,17 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
*/
if (!page && !drained) {
unreserve_highatomic_pageblock(ac, false);
- drain_all_pages(NULL);
+
+ /*
+ * Only drain from contexts allocating for user allocations.
+ * Kernel allocations could be holding a CPU hotplug-related
+ * mutex, particularly hot-add allocating per-cpu structures
+ * while hotplug-related mutex's are held which would prevent
+ * get_online_cpus ever returning.
+ */
+ if (gfp_mask & __GFP_HARDWALL)
+ drain_all_pages(NULL);
+
drained = true;
goto retry;
}

--
Mel Gorman
SUSE Labs

Michal Hocko

unread,

Feb 7, 2017, 3:43:05 AM2/7/17

to Dmitry Vyukov, Vlastimil Babka, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Mel Gorman, Andrew Morton

On Mon 06-02-17 20:13:35, Dmitry Vyukov wrote:
[...]

> Fuzzer now runs on 510948533b059f4f5033464f9f4a0c32d4ab0c08 of
> mmotm/auto-latest
> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git):
>
> commit 510948533b059f4f5033464f9f4a0c32d4ab0c08
> Date: Thu Feb 2 10:08:47 2017 +0100
> mmotm: userfaultfd-non-cooperative-add-event-for-memory-unmaps-fix
>
> The commit you referenced is already there:
>
> commit 806b158031ca0b4714e775898396529a758ebc2c
> Date: Thu Feb 2 08:53:16 2017 +0100
> mm, page_alloc: use static global work_struct for draining per-cpu pages
>
> But I still got:
>
> [ INFO: possible circular locking dependency detected ]
> 4.9.0 #6 Not tainted
> -------------------------------------------------------
> syz-executor1/8199 is trying to acquire lock:
> (cpu_hotplug.dep_map){++++++}, at: [<ffffffff81422fe7>] get_online_cpus+0x37/0x90 kernel/cpu.c:246
> but task is already holding lock:
> (pcpu_alloc_mutex){+.+.+.}, at: [<ffffffff818f07ea>] pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
> which lock already depends on the new lock.
>

the original was too hard to read so here is the reformated output.

> the existing dependency chain (in reverse order) is:
>
> [ 403.953319] [<ffffffff8156fc29>] validate_chain kernel/locking/lockdep.c:2265 [inline]
> [ 403.953319] [<ffffffff8156fc29>] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
> [ 403.961232] [<ffffffff81571db1>] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
> [ 403.968788] [<ffffffff8436697e>] __mutex_lock_common kernel/locking/mutex.c:521 [inline]
> [ 403.968788] [<ffffffff8436697e>] mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
> [ 403.976782] [<ffffffff818f07ea>] pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
> [ 403.984266] [<ffffffff818f0ee4>] __alloc_percpu+0x24/0x30 mm/percpu.c:1075
> [ 403.991873] [<ffffffff816543e3>] smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44
> [ 403.999799] [<ffffffff814240b4>] cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136
> [ 404.008253] [<ffffffff81425821>] cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493
> [ 404.016365] [<ffffffff81427bf3>] _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
> [ 404.023507] [<ffffffff81427d23>] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
> [ 404.030647] [<ffffffff81427d68>] cpu_up+0x18/0x20 kernel/cpu.c:1095
> [ 404.037523] [<ffffffff854ede84>] smp_init+0xe9/0xee kernel/smp.c:564
> [ 404.044559] [<ffffffff85482f81>] kernel_init_freeable+0x439/0x690 init/main.c:1010
> [ 404.052811] [<ffffffff84357083>] kernel_init+0x13/0x180 init/main.c:941
> [ 404.060198] [<ffffffff84377baa>] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

cpu_hotplug_begin
cpu_hotplug.lock
pcpu_alloc
pcpu_alloc_mutex

> [ 404.072827] [<ffffffff8156fc29>] validate_chain kernel/locking/lockdep.c:2265 [inline]
> [ 404.072827] [<ffffffff8156fc29>] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
> [ 404.080733] [<ffffffff81571db1>] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
> [ 404.088311] [<ffffffff8436697e>] __mutex_lock_common kernel/locking/mutex.c:521 [inline]
> [ 404.088311] [<ffffffff8436697e>] mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
> [ 404.096318] [<ffffffff81427876>] cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304
> [ 404.104321] [<ffffffff81427ada>] _cpu_up+0xca/0x2a0 kernel/cpu.c:1011
> [ 404.111357] [<ffffffff81427d23>] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
> [ 404.118480] [<ffffffff81427d68>] cpu_up+0x18/0x20 kernel/cpu.c:1095
> [ 404.125360] [<ffffffff854ede84>] smp_init+0xe9/0xee kernel/smp.c:564
> [ 404.132393] [<ffffffff85482f81>] kernel_init_freeable+0x439/0x690 init/main.c:1010
> [ 404.140668] [<ffffffff84357083>] kernel_init+0x13/0x180 init/main.c:941
> [ 404.148079] [<ffffffff84377baa>] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

cpu_hotplug_begin
cpu_hotplug.lock

> [ 404.160977] [<ffffffff8156976d>] check_prev_add kernel/locking/lockdep.c:1828 [inline]
> [ 404.160977] [<ffffffff8156976d>] check_prevs_add+0xa8d/0x1c00 kernel/locking/lockdep.c:1938
> [ 404.168898] [<ffffffff8156fc29>] validate_chain kernel/locking/lockdep.c:2265 [inline]
> [ 404.168898] [<ffffffff8156fc29>] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
> [ 404.176844] [<ffffffff81571db1>] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
> [ 404.184416] [<ffffffff81423012>] get_online_cpus+0x62/0x90 kernel/cpu.c:248
> [ 404.192103] [<ffffffff8185fcf8>] drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385
> [ 404.199880] [<ffffffff81865e5d>] __alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline]
> [ 404.199880] [<ffffffff81865e5d>] __alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778
> [ 404.208406] [<ffffffff818681c5>] __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980
> [ 404.216851] [<ffffffff818ed0c1>] __alloc_pages include/linux/gfp.h:426 [inline]
> [ 404.216851] [<ffffffff818ed0c1>] __alloc_pages_node include/linux/gfp.h:439 [inline]
> [ 404.216851] [<ffffffff818ed0c1>] alloc_pages_node include/linux/gfp.h:453 [inline]
> [ 404.216851] [<ffffffff818ed0c1>] pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
> [ 404.216851] [<ffffffff818ed0c1>] pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
> [ 404.225015] [<ffffffff818f0a11>] pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
> [ 404.232482] [<ffffffff818f0eb7>] __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062
> [ 404.240389] [<ffffffff817d25b2>] bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline]
> [ 404.240389] [<ffffffff817d25b2>] array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99
> [ 404.248224] [<ffffffff817ba034>] find_and_alloc_map kernel/bpf/syscall.c:34 [inline]
> [ 404.248224] [<ffffffff817ba034>] map_create kernel/bpf/syscall.c:188 [inline]
> [ 404.248224] [<ffffffff817ba034>] SYSC_bpf kernel/bpf/syscall.c:870 [inline]
> [ 404.248224] [<ffffffff817ba034>] SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827
> [ 404.255434] [<ffffffff84377941>] entry_SYSCALL_64_fastpath+0x1f/0xc2

pcpu_alloc
pcpu_alloc_mutex
drain_all_pages
pcpu_drain_mutex
get_online_cpus
cpu_hotplug.lock

so the deadlock is real!
--
Michal Hocko
SUSE Labs

Michal Hocko

unread,

Feb 7, 2017, 3:48:58 AM2/7/17

to Mel Gorman, Dmitry Vyukov, Vlastimil Babka, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

This wouldn't work AFAICS. If you look at the lockdep splat, the path
which reverses the locking order (takes pcpu_alloc_mutex prior to
cpu_hotplug.lock is bpf_array_alloc_percpu which is GFP_USER and thus
__GFP_HARDWALL.

I believe we shouldn't pull any dependency on the hotplug locks inside
the allocator. This is just too fragile! Can we simply drop the
get_online_cpus()? Why do we need it, anyway? Say we are racing with the
cpu offlining. I have to check the code but my impression was that WQ
code will ignore the cpu requested by the work item when the cpu is
going offline. If the offline happens while the worker function already
executes then it has to wait as we run with preemption disabled so we
should be safe here. Or am I missing something obvious?

Vlastimil Babka

unread,

Feb 7, 2017, 4:23:52 AM2/7/17

to Michal Hocko, Mel Gorman, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

It was added after I noticed in review that queue_work_on() has a
comment that caller must ensure that cpu can't go away, and wondered
about it. Also noted that a similar lru_add_drain_all() does it too.

> cpu offlining. I have to check the code but my impression was that WQ
> code will ignore the cpu requested by the work item when the cpu is
> going offline. If the offline happens while the worker function already
> executes then it has to wait as we run with preemption disabled so we
> should be safe here. Or am I missing something obvious?

Tejun suggested an alternative solution to avoiding get_online_cpus() in
this thread:
https://lkml.kernel.org/r/<2017012317...@htj.duckdns.org>

Mel Gorman

unread,

Feb 7, 2017, 4:43:02 AM2/7/17

to Michal Hocko, Dmitry Vyukov, Vlastimil Babka, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue, Feb 07, 2017 at 09:48:56AM +0100, Michal Hocko wrote:
> > +
> > + /*
> > + * Only drain from contexts allocating for user allocations.
> > + * Kernel allocations could be holding a CPU hotplug-related
> > + * mutex, particularly hot-add allocating per-cpu structures
> > + * while hotplug-related mutex's are held which would prevent
> > + * get_online_cpus ever returning.
> > + */
> > + if (gfp_mask & __GFP_HARDWALL)
> > + drain_all_pages(NULL);
> > +
>
> This wouldn't work AFAICS. If you look at the lockdep splat, the path
> which reverses the locking order (takes pcpu_alloc_mutex prior to
> cpu_hotplug.lock is bpf_array_alloc_percpu which is GFP_USER and thus
> __GFP_HARDWALL.
>

You're right, I looked at the wrong caller.

> I believe we shouldn't pull any dependency on the hotplug locks inside
> the allocator. This is just too fragile! Can we simply drop the
> get_online_cpus()? Why do we need it, anyway?

To stop the CPU being queued going offline. Another alternative is bringing
back try_get_offline_cpus which Peter pointed out to be offline was removed
in commit 02ef3c4a2aae65a1632b27770bfea3f83ca06772 although it'd be a
shame for just this case.

> Say we are racing with the
> cpu offlining. I have to check the code but my impression was that WQ
> code will ignore the cpu requested by the work item when the cpu is
> going offline. If the offline happens while the worker function already
> executes then it has to wait as we run with preemption disabled so we
> should be safe here. Or am I missing something obvious?

It may be safe but I'd prefer to get confirmation from Tejun. If it happens
that a drain on a particular CPU gets missed in this path, it's not the
end of the world as the CPU offlining drains the pages in another path so
nothing gets lost.

It would also be acceptable if one CPU was drained twice because the
workqueue was unbound from a CPU being hot-removed and moved during a
hotplug operation.

If I'm reading this right, a hot-remove will set the pool POOL_DISASSOCIATED
and unbound. A workqueue queued for draining get migrated during hot-remove
and a drain operation will execute twice on a CPU -- one for what was
queued and a second time for the CPU it was migrated from. It should still
work with flush_work which doesn't appear to block forever if an item
got migrated to another workqueue. The actual drain workqueue function is
using the CPU ID it's currently running on so it shouldn't get confused.

Tejun, did I miss anything? Does a workqueue item queued on a CPU being
offline get unbound and a caller can still flush it safely? In this
specific case, it's ok that the workqueue item does not run on the CPU it
was queued on.

Mel Gorman

unread,

Feb 7, 2017, 4:46:16 AM2/7/17

to Vlastimil Babka, Michal Hocko, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue, Feb 07, 2017 at 10:23:31AM +0100, Vlastimil Babka wrote:
>
> > cpu offlining. I have to check the code but my impression was that WQ
> > code will ignore the cpu requested by the work item when the cpu is
> > going offline. If the offline happens while the worker function already
> > executes then it has to wait as we run with preemption disabled so we
> > should be safe here. Or am I missing something obvious?
>
> Tejun suggested an alternative solution to avoiding get_online_cpus() in
> this thread:
> https://lkml.kernel.org/r/<2017012317...@htj.duckdns.org>

I was hoping it would not really be necessary to sync against the cpu
offline callback if we know we don't really have to guarantee we run on
the correct CPU.

Vlastimil Babka

unread,

Feb 7, 2017, 4:49:50 AM2/7/17

to Mel Gorman, Michal Hocko, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On 02/07/2017 10:43 AM, Mel Gorman wrote:
> If I'm reading this right, a hot-remove will set the pool POOL_DISASSOCIATED
> and unbound. A workqueue queued for draining get migrated during hot-remove
> and a drain operation will execute twice on a CPU -- one for what was
> queued and a second time for the CPU it was migrated from. It should still
> work with flush_work which doesn't appear to block forever if an item
> got migrated to another workqueue. The actual drain workqueue function is
> using the CPU ID it's currently running on so it shouldn't get confused.

Is the worker that will process this migrated workqueue also guaranteed
to be pinned to a cpu for the whole work, though? drain_local_pages()
needs that guarantee.

Michal Hocko

unread,

Feb 7, 2017, 4:53:22 AM2/7/17

to Vlastimil Babka, Mel Gorman, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

Ohh, I haven't noticed the comment. Thanks for pointing it out. I still
do not see what would a missing get_online_cpus mean for queuing.

> Also noted that a similar lru_add_drain_all() does it too.
>
> > cpu offlining. I have to check the code but my impression was that WQ
> > code will ignore the cpu requested by the work item when the cpu is
> > going offline. If the offline happens while the worker function already
> > executes then it has to wait as we run with preemption disabled so we
> > should be safe here. Or am I missing something obvious?
>
> Tejun suggested an alternative solution to avoiding get_online_cpus() in
> this thread:
> https://lkml.kernel.org/r/<2017012317...@htj.duckdns.org>

OK, so we have page_alloc_cpu_notify which also does drain_pages so all
we have to do to make sure they do not race is to synchronize there.

Michal Hocko

unread,

Feb 7, 2017, 5:05:06 AM2/7/17

to Vlastimil Babka, Mel Gorman, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue 07-02-17 10:49:28, Vlastimil Babka wrote:
> On 02/07/2017 10:43 AM, Mel Gorman wrote:
> > If I'm reading this right, a hot-remove will set the pool POOL_DISASSOCIATED
> > and unbound. A workqueue queued for draining get migrated during hot-remove
> > and a drain operation will execute twice on a CPU -- one for what was
> > queued and a second time for the CPU it was migrated from. It should still
> > work with flush_work which doesn't appear to block forever if an item
> > got migrated to another workqueue. The actual drain workqueue function is
> > using the CPU ID it's currently running on so it shouldn't get confused.
>
> Is the worker that will process this migrated workqueue also guaranteed
> to be pinned to a cpu for the whole work, though? drain_local_pages()
> needs that guarantee.

Yeah I guess you are right. This would mean that drain_local_pages_wq
should to preempt_{disable,enable} around drain_local_pages

>
> > Tejun, did I miss anything? Does a workqueue item queued on a CPU being
> > offline get unbound and a caller can still flush it safely? In this
> > specific case, it's ok that the workqueue item does not run on the CPU it
> > was queued on.

I guess we need to do one more step and ensure that our (rebound) worker
doesn't race with the page_alloc_cpu_notify. I guess we can just cmpxchg
pcp->count in drain_pages_zone to ensure the exclusivity. Not as simple
as I originally thought but doable I guess and definitely better than
making a subtle dependency on the hotplug locks which is just a PITA to
maintain.

Mel Gorman

unread,

Feb 7, 2017, 5:28:11 AM2/7/17

to Vlastimil Babka, Michal Hocko, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue, Feb 07, 2017 at 10:49:28AM +0100, Vlastimil Babka wrote:
> On 02/07/2017 10:43 AM, Mel Gorman wrote:
> > If I'm reading this right, a hot-remove will set the pool POOL_DISASSOCIATED
> > and unbound. A workqueue queued for draining get migrated during hot-remove
> > and a drain operation will execute twice on a CPU -- one for what was
> > queued and a second time for the CPU it was migrated from. It should still
> > work with flush_work which doesn't appear to block forever if an item
> > got migrated to another workqueue. The actual drain workqueue function is
> > using the CPU ID it's currently running on so it shouldn't get confused.
>
> Is the worker that will process this migrated workqueue also guaranteed
> to be pinned to a cpu for the whole work, though? drain_local_pages()
> needs that guarantee.
>

It should be by running on a workqueue handler bound to that CPU (queued
on wq->cpu_pwqs in __queue_work)

Michal Hocko

unread,

Feb 7, 2017, 5:35:54 AM2/7/17

to Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

Are you sure? The comment in kernel/workqueue.c says
* While DISASSOCIATED, the cpu may be offline and all workers have
* %WORKER_UNBOUND set and concurrency management disabled, and may
* be executing on any CPU. The pool behaves as an unbound one.

I might be misreadig but an unbound pool can be handled by workers which
are not pinned on any cpu AFAIU.

Mel Gorman

unread,

Feb 7, 2017, 5:42:51 AM2/7/17

to Vlastimil Babka, Michal Hocko, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue, Feb 07, 2017 at 10:23:31AM +0100, Vlastimil Babka wrote:

> > cpu offlining. I have to check the code but my impression was that WQ
> > code will ignore the cpu requested by the work item when the cpu is
> > going offline. If the offline happens while the worker function already
> > executes then it has to wait as we run with preemption disabled so we
> > should be safe here. Or am I missing something obvious?
>
> Tejun suggested an alternative solution to avoiding get_online_cpus() in
> this thread:
> https://lkml.kernel.org/r/<2017012317...@htj.duckdns.org>

But it would look like the following as it could be serialised against
pcpu_drain_mutex as the cpu hotplug teardown callback is allowed to sleep.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b93879990fd..8cd8b1bbe00c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2319,9 +2319,17 @@ static void drain_pages(unsigned int cpu)
{
struct zone *zone;

+ /*
+ * A per-cpu drain via a workqueue from drain_all_pages can be
+ * rescheduled onto an unrelated CPU. That allows the hotplug
+ * operation and the drain to potentially race on the same
+ * CPU. Serialise hotplug versus drain using pcpu_drain_mutex
+ */
+ mutex_lock(&pcpu_drain_mutex);
for_each_populated_zone(zone) {
drain_pages_zone(cpu, zone);
}
+ mutex_unlock(&pcpu_drain_mutex);
}

/*
@@ -2377,13 +2385,10 @@ void drain_all_pages(struct zone *zone)
mutex_lock(&pcpu_drain_mutex);
}

- get_online_cpus();
-
/*
- * We don't care about racing with CPU hotplug event
- * as offline notification will cause the notified
- * cpu to drain that CPU pcps and on_each_cpu_mask
- * disables preemption as part of its processing
+ * We don't care about racing with CPU hotplug event as offline
+ * notification will cause the notified cpu to drain that CPU pcps
+ * and it is serialised against here via pcpu_drain_mutex.
*/
for_each_online_cpu(cpu) {
struct per_cpu_pageset *pcp;
@@ -2418,7 +2423,6 @@ void drain_all_pages(struct zone *zone)
for_each_cpu(cpu, &cpus_with_pcps)
flush_work(per_cpu_ptr(&pcpu_drain, cpu));

- put_online_cpus();
mutex_unlock(&pcpu_drain_mutex);

Mel Gorman

unread,

Feb 7, 2017, 6:13:56 AM2/7/17

to Vlastimil Babka, Michal Hocko, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue, Feb 07, 2017 at 10:42:49AM +0000, Mel Gorman wrote:
> On Tue, Feb 07, 2017 at 10:23:31AM +0100, Vlastimil Babka wrote:
> > > cpu offlining. I have to check the code but my impression was that WQ
> > > code will ignore the cpu requested by the work item when the cpu is
> > > going offline. If the offline happens while the worker function already
> > > executes then it has to wait as we run with preemption disabled so we
> > > should be safe here. Or am I missing something obvious?
> >
> > Tejun suggested an alternative solution to avoiding get_online_cpus() in
> > this thread:
> > https://lkml.kernel.org/r/<2017012317...@htj.duckdns.org>
>
> But it would look like the following as it could be serialised against
> pcpu_drain_mutex as the cpu hotplug teardown callback is allowed to sleep.
>

Bah, this is obviously unsafe. It's guaranteed to deadlock.

Tetsuo Handa

unread,

Feb 7, 2017, 6:24:29 AM2/7/17

to Mel Gorman, Dmitry Vyukov, Vlastimil Babka, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Michal Hocko, Andrew Morton

Why below change on top of current linux.git (8b1b41ee74f9) insufficient?
I think it can eliminate IPIs a lot without introducing lockdep warnings.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3e0c69..ae6e7aa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2354,6 +2354,7 @@ void drain_local_pages(struct zone *zone)
*/
void drain_all_pages(struct zone *zone)
{
+ static DEFINE_MUTEX(lock);
int cpu;

/*
@@ -2362,6 +2363,7 @@ void drain_all_pages(struct zone *zone)
*/
static cpumask_t cpus_with_pcps;

+ mutex_lock(&lock);
/*

* We don't care about racing with CPU hotplug event

* as offline notification will cause the notified

@@ -2394,6 +2396,7 @@ void drain_all_pages(struct zone *zone)
}
on_each_cpu_mask(&cpus_with_pcps, (smp_call_func_t) drain_local_pages,
zone, 1);
+ mutex_unlock(&lock);
}

#ifdef CONFIG_HIBERNATION
----------

By the way, drain_all_pages() is a sleepable context, isn't it?
I don't get get soft lockup using current linux.git (8b1b41ee74f9).
But I trivially get soft lockup if I try above change after reverting
"mm, page_alloc: use static global work_struct for draining per-cpu pages" and
"mm, page_alloc: drain per-cpu pages from workqueue context" on linux-next-20170207 .
List corruption also came in with linux-next-20170207 .

----------
[ 32.672890] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 33.109860] Ebtables v2.0 registered
[ 33.410293] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[ 33.935512] IPv6: ADDRCONF(NETDEV_UP): eno16777728: link is not ready
[ 33.937478] e1000: eno16777728 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 33.939777] IPv6: ADDRCONF(NETDEV_CHANGE): eno16777728: link becomes ready
[ 34.194000] Netfilter messages via NETLINK v0.30.
[ 34.258828] ip_set: protocol 6
[ 38.518375] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based firewall rule not found. Use the iptables CT target to attach helpers instead.
[ 43.911609] cp (5167) used greatest stack depth: 10488 bytes left
[ 48.174125] cp (5860) used greatest stack depth: 10224 bytes left
[ 101.075280] ------------[ cut here ]------------
[ 101.076743] WARNING: CPU: 0 PID: 9766 at lib/list_debug.c:25 __list_add_valid+0x46/0xa0
[ 101.079095] list_add corruption. next->prev should be prev (ffffea00013dfce0), but was ffff88006d3e1d40. (next=ffff88006d3e1d40).
[ 101.081863] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd ppdev glue_helper vmw_balloon pcspkr sg parport_pc parport i2c_piix4 shpchp vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel vmwgfx serio_raw drm_kms_helper syscopyarea sysfillrect
[ 101.098346] sysimgblt fb_sys_fops ttm mptspi scsi_transport_spi ata_piix mptscsih ahci drm libahci mptbase libata e1000 i2c_core
[ 101.101279] CPU: 0 PID: 9766 Comm: oom-write Tainted: G W 4.10.0-rc7-next-20170207+ #500
[ 101.103519] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 101.106031] Call Trace:
[ 101.107069] dump_stack+0x85/0xc9
[ 101.108239] __warn+0xd1/0xf0
[ 101.109566] warn_slowpath_fmt+0x5f/0x80
[ 101.110843] __list_add_valid+0x46/0xa0
[ 101.112187] free_hot_cold_page+0x205/0x460
[ 101.113593] free_hot_cold_page_list+0x3c/0x1c0
[ 101.115046] shrink_page_list+0x4dd/0xd10
[ 101.116388] shrink_inactive_list+0x1c5/0x660
[ 101.117796] shrink_node_memcg+0x535/0x7f0
[ 101.119158] ? mem_cgroup_iter+0x1e0/0x720
[ 101.120873] shrink_node+0xe1/0x310
[ 101.122250] do_try_to_free_pages+0xe1/0x300
[ 101.123611] try_to_free_pages+0x131/0x3f0
[ 101.124998] __alloc_pages_slowpath+0x479/0xe32
[ 101.126421] __alloc_pages_nodemask+0x382/0x3d0
[ 101.128053] alloc_pages_vma+0xae/0x2f0
[ 101.129353] do_anonymous_page+0x111/0x5d0
[ 101.130725] __handle_mm_fault+0xbc9/0xeb0
[ 101.132127] ? sched_clock+0x9/0x10
[ 101.133389] ? sched_clock_cpu+0x11/0xc0
[ 101.134764] handle_mm_fault+0x16b/0x390
[ 101.136147] ? handle_mm_fault+0x49/0x390
[ 101.137545] __do_page_fault+0x24a/0x530
[ 101.138999] do_page_fault+0x30/0x80
[ 101.140375] page_fault+0x28/0x30
[ 101.141685] RIP: 0033:0x4006a0
[ 101.142880] RSP: 002b:00007ffe3a4acc10 EFLAGS: 00010206
[ 101.144638] RAX: 0000000037f8e000 RBX: 0000000080000000 RCX: 00007fc5a1fda650
[ 101.146653] RDX: 0000000000000000 RSI: 00007ffe3a4aca30 RDI: 00007ffe3a4aca30
[ 101.148768] RBP: 00007fc4a2111010 R08: 00007ffe3a4acb40 R09: 00007ffe3a4ac980
[ 101.150893] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000007
[ 101.152937] R13: 00007fc4a2111010 R14: 0000000000000000 R15: 0000000000000000
[ 101.155240] ---[ end trace 5d8b63572ab78be3 ]---
[ 128.945470] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [oom-write:9766]
[ 128.947878] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd ppdev glue_helper vmw_balloon pcspkr sg parport_pc parport i2c_piix4 shpchp vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel vmwgfx serio_raw drm_kms_helper syscopyarea sysfillrect
[ 128.966961] sysimgblt fb_sys_fops ttm mptspi scsi_transport_spi ata_piix mptscsih ahci drm libahci mptbase libata e1000 i2c_core
[ 128.970135] irq event stamp: 2491700
[ 128.971666] hardirqs last enabled at (2491699): [<ffffffff817e5d70>] restore_regs_and_iret+0x0/0x1d
[ 128.974560] hardirqs last disabled at (2491700): [<ffffffff817e7198>] apic_timer_interrupt+0x98/0xb0
[ 128.977148] softirqs last enabled at (2445236): [<ffffffff817eab39>] __do_softirq+0x349/0x52d
[ 128.979639] softirqs last disabled at (2445215): [<ffffffff810a99e5>] irq_exit+0xf5/0x110
[ 128.982019] CPU: 2 PID: 9766 Comm: oom-write Tainted: G W 4.10.0-rc7-next-20170207+ #500
[ 128.984598] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 128.987490] task: ffff880067ed8040 task.stack: ffffc9001117c000
[ 128.989464] RIP: 0010:smp_call_function_many+0x25c/0x320
[ 128.991305] RSP: 0000:ffffc9001117fad0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
[ 128.993609] RAX: 0000000000000000 RBX: ffff88006d7dd640 RCX: 0000000000000001
[ 128.995823] RDX: 0000000000000001 RSI: ffff88006d3e3398 RDI: ffff88006c528dc8
[ 128.998058] RBP: ffffc9001117fb18 R08: 0000000000000009 R09: 0000000000000000
[ 129.000284] R10: 0000000000000001 R11: ffff88005486d4a8 R12: 0000000000000000
[ 129.002521] R13: ffffffff811fe6e0 R14: 0000000000000000 R15: 0000000000000080
[ 129.004788] FS: 00007fc5a24ec740(0000) GS:ffff88006d600000(0000) knlGS:0000000000000000
[ 129.007235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 129.009211] CR2: 00007fc4da0c5010 CR3: 0000000044314000 CR4: 00000000001406e0
[ 129.011641] Call Trace:
[ 129.012993] ? page_alloc_cpu_dead+0x30/0x30
[ 129.014701] on_each_cpu_mask+0x30/0xb0
[ 129.016318] drain_all_pages+0x113/0x170
[ 129.017949] __alloc_pages_slowpath+0x520/0xe32
[ 129.019710] __alloc_pages_nodemask+0x382/0x3d0
[ 129.021472] alloc_pages_vma+0xae/0x2f0
[ 129.023080] do_anonymous_page+0x111/0x5d0
[ 129.024760] __handle_mm_fault+0xbc9/0xeb0
[ 129.026449] ? sched_clock+0x9/0x10
[ 129.028042] ? sched_clock_cpu+0x11/0xc0
[ 129.029729] handle_mm_fault+0x16b/0x390
[ 129.031418] ? handle_mm_fault+0x49/0x390
[ 129.033077] __do_page_fault+0x24a/0x530
[ 129.034721] do_page_fault+0x30/0x80
[ 129.036279] page_fault+0x28/0x30
[ 129.038160] RIP: 0033:0x4006a0
[ 129.039600] RSP: 002b:00007ffe3a4acc10 EFLAGS: 00010206
[ 129.041488] RAX: 0000000037fb4000 RBX: 0000000080000000 RCX: 00007fc5a1fda650
[ 129.043758] RDX: 0000000000000000 RSI: 00007ffe3a4aca30 RDI: 00007ffe3a4aca30
[ 129.046037] RBP: 00007fc4a2111010 R08: 00007ffe3a4acb40 R09: 00007ffe3a4ac980
[ 129.048324] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000007
[ 129.050587] R13: 00007fc4a2111010 R14: 0000000000000000 R15: 0000000000000000
[ 129.052829] Code: 7b 3e 2b 00 3b 05 69 b6 d5 00 41 89 c4 0f 8d 3f fe ff ff 48 63 d0 48 8b 33 48 03 34 d5 60 c4 ab 81 8b 56 18 83 e2 01 74 0a f3 90 <8b> 4e 18 83 e1 01 75 f6 83 f8 ff 48 8b 7b 08 74 2a 48 63 35 30
[ 156.945366] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [oom-write:9766]
[ 156.947797] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd ppdev glue_helper vmw_balloon pcspkr sg parport_pc parport i2c_piix4 shpchp vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel vmwgfx serio_raw drm_kms_helper syscopyarea sysfillrect
[ 156.968021] sysimgblt fb_sys_fops ttm mptspi scsi_transport_spi ata_piix mptscsih ahci drm libahci mptbase libata e1000 i2c_core
[ 156.971269] irq event stamp: 2547400
[ 156.972865] hardirqs last enabled at (2547399): [<ffffffff817e5d70>] restore_regs_and_iret+0x0/0x1d
[ 156.975579] hardirqs last disabled at (2547400): [<ffffffff817e7198>] apic_timer_interrupt+0x98/0xb0
[ 156.978287] softirqs last enabled at (2445236): [<ffffffff817eab39>] __do_softirq+0x349/0x52d
[ 156.980863] softirqs last disabled at (2445215): [<ffffffff810a99e5>] irq_exit+0xf5/0x110
[ 156.983393] CPU: 2 PID: 9766 Comm: oom-write Tainted: G W L 4.10.0-rc7-next-20170207+ #500
[ 156.986111] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 156.989117] task: ffff880067ed8040 task.stack: ffffc9001117c000
[ 156.991179] RIP: 0010:smp_call_function_many+0x25c/0x320
[ 156.993113] RSP: 0000:ffffc9001117fad0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
[ 156.995501] RAX: 0000000000000000 RBX: ffff88006d7dd640 RCX: 0000000000000001
[ 156.997803] RDX: 0000000000000001 RSI: ffff88006d3e3398 RDI: ffff88006c528dc8
[ 157.000111] RBP: ffffc9001117fb18 R08: 0000000000000009 R09: 0000000000000000
[ 157.002402] R10: 0000000000000001 R11: ffff88005486d4a8 R12: 0000000000000000
[ 157.004700] R13: ffffffff811fe6e0 R14: 0000000000000000 R15: 0000000000000080
[ 157.006980] FS: 00007fc5a24ec740(0000) GS:ffff88006d600000(0000) knlGS:0000000000000000
[ 157.009477] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 157.011494] CR2: 00007fc4da0c5010 CR3: 0000000044314000 CR4: 00000000001406e0
[ 157.013844] Call Trace:
[ 157.015231] ? page_alloc_cpu_dead+0x30/0x30
[ 157.016975] on_each_cpu_mask+0x30/0xb0
[ 157.018723] drain_all_pages+0x113/0x170
[ 157.020410] __alloc_pages_slowpath+0x520/0xe32
[ 157.022219] __alloc_pages_nodemask+0x382/0x3d0
[ 157.024036] alloc_pages_vma+0xae/0x2f0
[ 157.025704] do_anonymous_page+0x111/0x5d0
[ 157.027430] __handle_mm_fault+0xbc9/0xeb0
[ 157.029143] ? sched_clock+0x9/0x10
[ 157.030739] ? sched_clock_cpu+0x11/0xc0
[ 157.032416] handle_mm_fault+0x16b/0x390
[ 157.034100] ? handle_mm_fault+0x49/0x390
[ 157.035843] __do_page_fault+0x24a/0x530
[ 157.037537] do_page_fault+0x30/0x80
[ 157.039160] page_fault+0x28/0x30
[ 157.040792] RIP: 0033:0x4006a0
[ 157.042290] RSP: 002b:00007ffe3a4acc10 EFLAGS: 00010206
[ 157.044231] RAX: 0000000037fb4000 RBX: 0000000080000000 RCX: 00007fc5a1fda650
[ 157.046635] RDX: 0000000000000000 RSI: 00007ffe3a4aca30 RDI: 00007ffe3a4aca30
[ 157.048941] RBP: 00007fc4a2111010 R08: 00007ffe3a4acb40 R09: 00007ffe3a4ac980
[ 157.051324] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000007
[ 157.053577] R13: 00007fc4a2111010 R14: 0000000000000000 R15: 0000000000000000
[ 157.055822] Code: 7b 3e 2b 00 3b 05 69 b6 d5 00 41 89 c4 0f 8d 3f fe ff ff 48 63 d0 48 8b 33 48 03 34 d5 60 c4 ab 81 8b 56 18 83 e2 01 74 0a f3 90 <8b> 4e 18 83 e1 01 75 f6 83 f8 ff 48 8b 7b 08 74 2a 48 63 35 30
[ 171.241423] BUG: spinlock lockup suspected on CPU#3, swapper/3/0
[ 171.243575] lock: 0xffff88007ffddd00, .magic: dead4ead, .owner: kworker/0:3/9777, .owner_cpu: 0
[ 171.246136] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G W L 4.10.0-rc7-next-20170207+ #500
[ 171.248706] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 171.251617] Call Trace:
[ 171.252854] <IRQ>
[ 171.253989] dump_stack+0x85/0xc9
[ 171.255376] spin_dump+0x90/0x95
[ 171.256741] do_raw_spin_lock+0x9a/0x130
[ 171.258246] _raw_spin_lock_irqsave+0x75/0x90
[ 171.259822] ? free_pcppages_bulk+0x37/0x910
[ 171.261359] free_pcppages_bulk+0x37/0x910
[ 171.262843] ? sched_clock_cpu+0x11/0xc0
[ 171.264289] ? sched_clock_tick+0x2d/0x80
[ 171.265747] drain_pages_zone+0x82/0x90
[ 171.267154] ? page_alloc_cpu_dead+0x30/0x30
[ 171.268644] drain_pages+0x3f/0x60
[ 171.269955] drain_local_pages+0x25/0x30
[ 171.271366] flush_smp_call_function_queue+0x7b/0x170
[ 171.273014] generic_smp_call_function_single_interrupt+0x13/0x30
[ 171.274878] smp_call_function_interrupt+0x27/0x40
[ 171.276460] call_function_interrupt+0x9d/0xb0
[ 171.277963] RIP: 0010:native_safe_halt+0x6/0x10
[ 171.279467] RSP: 0018:ffffc900003a3e70 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff03
[ 171.281610] RAX: ffff88006c5c0040 RBX: 0000000000000000 RCX: 0000000000000000
[ 171.283662] RDX: ffff88006c5c0040 RSI: 0000000000000001 RDI: ffff88006c5c0040
[ 171.285727] RBP: ffffc900003a3e70 R08: 0000000000000000 R09: 0000000000000000
[ 171.287778] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000003
[ 171.289831] R13: ffff88006c5c0040 R14: ffff88006c5c0040 R15: 0000000000000000
[ 171.291882] </IRQ>
[ 171.292934] default_idle+0x23/0x1d0
[ 171.294287] arch_cpu_idle+0xf/0x20
[ 171.295618] default_idle_call+0x23/0x40
[ 171.297040] do_idle+0x162/0x230
[ 171.298308] cpu_startup_entry+0x71/0x80
[ 171.299732] start_secondary+0x17f/0x1f0
[ 171.301146] start_cpu+0x14/0x14
[ 171.302432] NMI backtrace for cpu 3
[ 171.303764] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G W L 4.10.0-rc7-next-20170207+ #500
[ 171.306218] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 171.309065] Call Trace:
[ 171.310500] <IRQ>
[ 171.311569] dump_stack+0x85/0xc9
[ 171.312946] nmi_cpu_backtrace+0xc0/0xe0
[ 171.314396] ? irq_force_complete_move+0x170/0x170
[ 171.316012] nmi_trigger_cpumask_backtrace+0x12a/0x188
[ 171.317679] arch_trigger_cpumask_backtrace+0x19/0x20
[ 171.319294] do_raw_spin_lock+0xa8/0x130
[ 171.320686] _raw_spin_lock_irqsave+0x75/0x90
[ 171.322148] ? free_pcppages_bulk+0x37/0x910
[ 171.323585] free_pcppages_bulk+0x37/0x910
[ 171.325037] ? sched_clock_cpu+0x11/0xc0
[ 171.326615] ? sched_clock_tick+0x2d/0x80
[ 171.327974] drain_pages_zone+0x82/0x90
[ 171.329294] ? page_alloc_cpu_dead+0x30/0x30
[ 171.330707] drain_pages+0x3f/0x60
[ 171.331938] drain_local_pages+0x25/0x30
[ 171.333283] flush_smp_call_function_queue+0x7b/0x170
[ 171.334855] generic_smp_call_function_single_interrupt+0x13/0x30
[ 171.336643] smp_call_function_interrupt+0x27/0x40
[ 171.338456] call_function_interrupt+0x9d/0xb0
[ 171.339911] RIP: 0010:native_safe_halt+0x6/0x10
[ 171.341522] RSP: 0018:ffffc900003a3e70 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff03
[ 171.343860] RAX: ffff88006c5c0040 RBX: 0000000000000000 RCX: 0000000000000000
[ 171.346001] RDX: ffff88006c5c0040 RSI: 0000000000000001 RDI: ffff88006c5c0040
[ 171.348134] RBP: ffffc900003a3e70 R08: 0000000000000000 R09: 0000000000000000
[ 171.350270] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000003
[ 171.352384] R13: ffff88006c5c0040 R14: ffff88006c5c0040 R15: 0000000000000000
[ 171.354503] </IRQ>
[ 171.355616] default_idle+0x23/0x1d0
[ 171.357055] arch_cpu_idle+0xf/0x20
[ 171.361577] default_idle_call+0x23/0x40
[ 171.364329] do_idle+0x162/0x230
[ 171.365799] cpu_startup_entry+0x71/0x80
[ 171.367172] start_secondary+0x17f/0x1f0
[ 171.368530] start_cpu+0x14/0x14
[ 171.369748] Sending NMI from CPU 3 to CPUs 0-2:
[ 171.371367] NMI backtrace for cpu 2
[ 171.371368] CPU: 2 PID: 9766 Comm: oom-write Tainted: G W L 4.10.0-rc7-next-20170207+ #500
[ 171.371368] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 171.371369] task: ffff880067ed8040 task.stack: ffffc9001117c000
[ 171.371369] RIP: 0010:smp_call_function_many+0x25c/0x320
[ 171.371370] RSP: 0000:ffffc9001117fad0 EFLAGS: 00000202
[ 171.371371] RAX: 0000000000000000 RBX: ffff88006d7dd640 RCX: 0000000000000001
[ 171.371372] RDX: 0000000000000001 RSI: ffff88006d3e3398 RDI: ffff88006c528dc8
[ 171.371372] RBP: ffffc9001117fb18 R08: 0000000000000009 R09: 0000000000000000
[ 171.371372] R10: 0000000000000001 R11: ffff88005486d4a8 R12: 0000000000000000
[ 171.371373] R13: ffffffff811fe6e0 R14: 0000000000000000 R15: 0000000000000080
[ 171.371373] FS: 00007fc5a24ec740(0000) GS:ffff88006d600000(0000) knlGS:0000000000000000
[ 171.371374] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 171.371374] CR2: 00007fc4da0c5010 CR3: 0000000044314000 CR4: 00000000001406e0
[ 171.371375] Call Trace:
[ 171.371375] ? page_alloc_cpu_dead+0x30/0x30
[ 171.371375] on_each_cpu_mask+0x30/0xb0
[ 171.371376] drain_all_pages+0x113/0x170
[ 171.371376] __alloc_pages_slowpath+0x520/0xe32
[ 171.371377] __alloc_pages_nodemask+0x382/0x3d0
[ 171.371377] alloc_pages_vma+0xae/0x2f0
[ 171.371377] do_anonymous_page+0x111/0x5d0
[ 171.371378] __handle_mm_fault+0xbc9/0xeb0
[ 171.371378] ? sched_clock+0x9/0x10
[ 171.371379] ? sched_clock_cpu+0x11/0xc0
[ 171.371379] handle_mm_fault+0x16b/0x390
[ 171.371379] ? handle_mm_fault+0x49/0x390
[ 171.371380] __do_page_fault+0x24a/0x530
[ 171.371380] do_page_fault+0x30/0x80
[ 171.371381] page_fault+0x28/0x30
[ 171.371381] RIP: 0033:0x4006a0
[ 171.371381] RSP: 002b:00007ffe3a4acc10 EFLAGS: 00010206
[ 171.371382] RAX: 0000000037fb4000 RBX: 0000000080000000 RCX: 00007fc5a1fda650
[ 171.371383] RDX: 0000000000000000 RSI: 00007ffe3a4aca30 RDI: 00007ffe3a4aca30
[ 171.371383] RBP: 00007fc4a2111010 R08: 00007ffe3a4acb40 R09: 00007ffe3a4ac980
[ 171.371384] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000007
[ 171.371384] R13: 00007fc4a2111010 R14: 0000000000000000 R15: 0000000000000000
[ 171.371385] Code: 7b 3e 2b 00 3b 05 69 b6 d5 00 41 89 c4 0f 8d 3f fe ff ff 48 63 d0 48 8b 33 48 03 34 d5 60 c4 ab 81 8b 56 18 83 e2 01 74 0a f3 90 <8b> 4e 18 83 e1 01 75 f6 83 f8 ff 48 8b 7b 08 74 2a 48 63 35 30
[ 171.372317] NMI backtrace for cpu 0
[ 171.372317] CPU: 0 PID: 9777 Comm: kworker/0:3 Tainted: G W L 4.10.0-rc7-next-20170207+ #500
[ 171.372318] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 171.372318] Workqueue: events vmw_fb_dirty_flush [vmwgfx]
[ 171.372319] task: ffff880064082540 task.stack: ffffc90013b54000
[ 171.372320] RIP: 0010:free_pcppages_bulk+0xbb/0x910
[ 171.372320] RSP: 0000:ffff88006d203eb0 EFLAGS: 00000002
[ 171.372321] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000010
[ 171.372322] RDX: 0000000000000000 RSI: 00000000357cc759 RDI: ffff88006d3e1d50
[ 171.372322] RBP: ffff88006d203f38 R08: ffff88006d3e1d20 R09: 0000000000000002
[ 171.372323] R10: 0000000000000000 R11: 000000000004f7f0 R12: ffff88007ffdd8f8
[ 171.372323] R13: ffffea00013dfc20 R14: ffffea00013dfc00 R15: ffff88007ffdd740
[ 171.372324] FS: 0000000000000000(0000) GS:ffff88006d200000(0000) knlGS:0000000000000000
[ 171.372324] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 171.372325] CR2: 00007fc4da09f010 CR3: 0000000044314000 CR4: 00000000001406f0
[ 171.372325] Call Trace:
[ 171.372325] <IRQ>
[ 171.372326] ? trace_hardirqs_off+0xd/0x10
[ 171.372326] drain_pages_zone+0x82/0x90
[ 171.372327] ? page_alloc_cpu_dead+0x30/0x30
[ 171.372327] drain_pages+0x3f/0x60
[ 171.372327] drain_local_pages+0x25/0x30
[ 171.372328] flush_smp_call_function_queue+0x7b/0x170
[ 171.372328] generic_smp_call_function_single_interrupt+0x13/0x30
[ 171.372329] smp_call_function_interrupt+0x27/0x40
[ 171.372329] call_function_interrupt+0x9d/0xb0
[ 171.372330] RIP: 0010:memcpy_orig+0x19/0x110
[ 171.372330] RSP: 0000:ffffc90013b57d98 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff03
[ 171.372331] RAX: ffffc90003f1c400 RBX: ffff880063b54a88 RCX: 0000000000000004
[ 171.372331] RDX: 00000000000010e0 RSI: ffffc900037d3900 RDI: ffffc90003f1c6e0
[ 171.372332] RBP: ffffc90013b57e08 R08: 0000000000000000 R09: 0000000000000000
[ 171.372332] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000001400
[ 171.372333] R13: 000000000000011e R14: ffff880063b55168 R15: ffffc900037d3620
[ 171.372333] </IRQ>
[ 171.372333] ? vmw_fb_dirty_flush+0x1ef/0x2b0 [vmwgfx]
[ 171.372334] process_one_work+0x22b/0x760
[ 171.372334] ? process_one_work+0x194/0x760
[ 171.372335] worker_thread+0x137/0x4b0
[ 171.372335] kthread+0x10f/0x150
[ 171.372335] ? process_one_work+0x760/0x760
[ 171.372336] ? kthread_create_on_node+0x70/0x70
[ 171.372336] ? do_syscall_64+0x6c/0x200
[ 171.372337] ret_from_fork+0x31/0x40
[ 171.372337] Code: 00 00 4d 89 cf 8b 45 98 8b 75 9c 31 d2 4c 8b 85 78 ff ff ff 83 c0 01 83 c6 01 83 f8 03 0f 44 c2 48 63 c8 48 83 c1 01 48 c1 e1 04 <4c> 01 c1 48 8b 39 48 39 f9 74 de 89 45 98 83 fe 03 89 f0 0f 44
[ 171.372357] NMI backtrace for cpu 1
[ 171.372357] CPU: 1 PID: 47 Comm: khugepaged Tainted: G W L 4.10.0-rc7-next-20170207+ #500
[ 171.372358] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 171.372358] task: ffff88006bd22540 task.stack: ffffc90000890000
[ 171.372359] RIP: 0010:delay_tsc+0x48/0x70
[ 171.372359] RSP: 0018:ffffc90000893520 EFLAGS: 00000006
[ 171.372360] RAX: 000000004e0ee6be RBX: ffff88007ffddd00 RCX: 000000774e0ee6a6
[ 171.372360] RDX: 0000000000000077 RSI: 0000000000000001 RDI: 0000000000000001
[ 171.372361] RBP: ffffc90000893520 R08: 0000000000000000 R09: 0000000000000000
[ 171.372361] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000a6822110
[ 171.372362] R13: 0000000083093e7f R14: 0000000000000001 R15: ffffea000137a020
[ 171.372362] FS: 0000000000000000(0000) GS:ffff88006d400000(0000) knlGS:0000000000000000
[ 171.372363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 171.372363] CR2: 00007ff5aeb82000 CR3: 0000000068093000 CR4: 00000000001406e0
[ 171.372363] Call Trace:
[ 171.372364] __delay+0xf/0x20
[ 171.372364] do_raw_spin_lock+0x86/0x130
[ 171.372365] _raw_spin_lock_irqsave+0x75/0x90
[ 171.372365] ? free_pcppages_bulk+0x37/0x910
[ 171.372365] free_pcppages_bulk+0x37/0x910
[ 171.372366] ? __kernel_map_pages+0x87/0x120
[ 171.372366] free_hot_cold_page+0x373/0x460
[ 171.372367] free_hot_cold_page_list+0x3c/0x1c0
[ 171.372367] shrink_page_list+0x4dd/0xd10
[ 171.372368] shrink_inactive_list+0x1c5/0x660
[ 171.372368] shrink_node_memcg+0x535/0x7f0
[ 171.372368] ? mem_cgroup_iter+0x14d/0x720
[ 171.372369] shrink_node+0xe1/0x310
[ 171.372369] do_try_to_free_pages+0xe1/0x300
[ 171.372370] try_to_free_pages+0x131/0x3f0
[ 171.372370] __alloc_pages_slowpath+0x479/0xe32
[ 171.372370] __alloc_pages_nodemask+0x382/0x3d0
[ 171.372371] khugepaged_alloc_page+0x6d/0xd0
[ 171.372371] collapse_huge_page+0x81/0x1240
[ 171.372372] ? sched_clock_cpu+0x11/0xc0
[ 171.372372] ? khugepaged_scan_mm_slot+0xc26/0x1000
[ 171.372373] khugepaged_scan_mm_slot+0xc49/0x1000
[ 171.372373] ? sched_clock_cpu+0x11/0xc0
[ 171.372373] ? finish_wait+0x75/0x90
[ 171.372374] khugepaged+0x327/0x5e0
[ 171.372374] ? remove_wait_queue+0x60/0x60
[ 171.372375] kthread+0x10f/0x150
[ 171.372375] ? khugepaged_scan_mm_slot+0x1000/0x1000
[ 171.372375] ? kthread_create_on_node+0x70/0x70
[ 171.372376] ret_from_fork+0x31/0x40
[ 171.372376] Code: 89 d1 48 c1 e1 20 48 09 c1 eb 1b 65 ff 0d c9 0a c1 7e f3 90 65 ff 05 c0 0a c1 7e 65 8b 05 51 d7 c0 7e 39 c6 75 20 0f ae e8 0f 31 <48> c1 e2 20 48 09 c2 48 89 d0 48 29 c8 48 39 f8 72 ce 65 ff 0d
----------

I also got soft lockup without any change using linux-next-20170202.
Something is wrong with calling multiple CPUs? List corruption?

----------
[ 80.556598] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 84.020374] IPv6: ADDRCONF(NETDEV_UP): eno16777728: link is not ready
[ 84.024423] e1000: eno16777728 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 84.030731] IPv6: ADDRCONF(NETDEV_CHANGE): eno16777728: link becomes ready
[ 84.212613] Ebtables v2.0 registered
[ 86.841246] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[ 90.594709] Netfilter messages via NETLINK v0.30.
[ 90.756119] ip_set: protocol 6
[ 161.309259] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [ip6tables-resto:4210]
[ 162.329982] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd ppdev glue_helper vmw_balloon pcspkr sg parport_pc parport i2c_piix4 shpchp vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect
[ 162.486639] sysimgblt fb_sys_fops ttm e1000 mptspi ahci scsi_transport_spi drm libahci ata_piix mptscsih i2c_core mptbase libata
[ 162.486651] irq event stamp: 306010
[ 162.486656] hardirqs last enabled at (306009): [<ffffffff817e4970>] restore_regs_and_iret+0x0/0x1d
[ 162.486658] hardirqs last disabled at (306010): [<ffffffff817e5d98>] apic_timer_interrupt+0x98/0xb0
[ 162.486661] softirqs last enabled at (306008): [<ffffffff817e9739>] __do_softirq+0x349/0x52d
[ 162.486664] softirqs last disabled at (306001): [<ffffffff810a98c5>] irq_exit+0xf5/0x110
[ 162.486666] CPU: 3 PID: 4210 Comm: ip6tables-resto Not tainted 4.10.0-rc6-next-20170202 #498
[ 162.486667] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 162.486668] task: ffff880062eb4a40 task.stack: ffffc900054e8000
[ 162.486671] RIP: 0010:smp_call_function_many+0x25c/0x320
[ 162.486672] RSP: 0018:ffffc900054ebc98 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
[ 162.486674] RAX: 0000000000000000 RBX: ffff88006d9dd680 RCX: 0000000000000001
[ 162.486675] RDX: 0000000000000001 RSI: ffff88006d3e3600 RDI: ffff88006c52ad68
[ 162.486675] RBP: ffffc900054ebce0 R08: 0000000000000007 R09: 0000000000000000
[ 162.486676] R10: 0000000000000001 R11: ffff880067ecd768 R12: 0000000000000000
[ 162.486677] R13: ffffffff81080790 R14: ffffc900054ebd18 R15: 0000000000000080
[ 162.486678] FS: 00007f37a4b86740(0000) GS:ffff88006d800000(0000) knlGS:0000000000000000
[ 162.486679] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 162.486680] CR2: 00000000008ed017 CR3: 0000000053f70000 CR4: 00000000001406e0
[ 162.486719] Call Trace:
[ 162.486725] ? x86_configure_nx+0x50/0x50
[ 162.486727] on_each_cpu+0x3b/0xa0
[ 162.486730] flush_tlb_kernel_range+0x79/0x80
[ 162.486734] remove_vm_area+0xb1/0xc0
[ 162.486737] __vunmap+0x2e/0x110
[ 162.486739] vfree+0x2e/0x70
[ 162.486744] do_ip6t_get_ctl+0x2de/0x370 [ip6_tables]
[ 162.486751] nf_getsockopt+0x49/0x70
[ 162.486755] ipv6_getsockopt+0xd3/0x130
[ 162.486758] rawv6_getsockopt+0x2c/0x70
[ 162.486761] sock_common_getsockopt+0x14/0x20
[ 162.486763] SyS_getsockopt+0x77/0xe0
[ 162.486767] do_syscall_64+0x6c/0x200
[ 162.486770] entry_SYSCALL64_slow_path+0x25/0x25
[ 162.486771] RIP: 0033:0x7f37a3d9151a
[ 162.486772] RSP: 002b:00007ffeecccd9e8 EFLAGS: 00000202 ORIG_RAX: 0000000000000037
[ 162.486774] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f37a3d9151a
[ 162.486774] RDX: 0000000000000041 RSI: 0000000000000029 RDI: 0000000000000003
[ 162.486775] RBP: 00000000008e90c0 R08: 00007ffeecccda30 R09: feff7164736b6865
[ 162.486776] R10: 00000000008e90c0 R11: 0000000000000202 R12: 00007ffeecccdf60
[ 162.486777] R13: 00000000008e9010 R14: 00007ffeecccda40 R15: 0000000000000000
[ 162.486783] Code: bb 39 2b 00 3b 05 a9 40 c7 00 41 89 c4 0f 8d 3f fe ff ff 48 63 d0 48 8b 33 48 03 34 d5 60 c4 ab 81 8b 56 18 83 e2 01 74 0a f3 90 <8b> 4e 18 83 e1 01 75 f6 83 f8 ff 48 8b 7b 08 74 2a 48 63 35 70
----------

Mel Gorman

unread,

Feb 7, 2017, 6:34:36 AM2/7/17

to Michal Hocko, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

Right. The unbind operation can set a mask that is any allowable CPU and
the final process_work is not done in a context that prevents
preemption.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b93879990fd..7af165d308c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2342,7 +2342,14 @@ void drain_local_pages(struct zone *zone)

static void drain_local_pages_wq(struct work_struct *work)
{
+ /*
+ * Ordinarily a drain operation is bound to a CPU but may be unbound
+ * after a CPU hotplug operation so it's necessary to disable
+ * preemption for the drain to stabilise the CPU ID.
+ */
+ preempt_disable();
drain_local_pages(NULL);
+ preempt_enable_no_resched();
}

/*
@@ -2377,13 +2384,10 @@ void drain_all_pages(struct zone *zone)

mutex_lock(&pcpu_drain_mutex);
}

- get_online_cpus();
-
/*
- * We don't care about racing with CPU hotplug event
- * as offline notification will cause the notified
- * cpu to drain that CPU pcps and on_each_cpu_mask
- * disables preemption as part of its processing
+ * We don't care about racing with CPU hotplug event as offline
+ * notification will cause the notified cpu to drain that CPU pcps
+ * and it is serialised against here via pcpu_drain_mutex.
*/
for_each_online_cpu(cpu) {
struct per_cpu_pageset *pcp;

@@ -2418,7 +2422,6 @@ void drain_all_pages(struct zone *zone)

for_each_cpu(cpu, &cpus_with_pcps)
flush_work(per_cpu_ptr(&pcpu_drain, cpu));

- put_online_cpus();
mutex_unlock(&pcpu_drain_mutex);
}

@@ -6711,7 +6714,16 @@ static int page_alloc_cpu_dead(unsigned int cpu)
{

lru_add_drain_cpu(cpu);
+
+ /*

+ * A per-cpu drain via a workqueue from drain_all_pages can be
+ * rescheduled onto an unrelated CPU. That allows the hotplug
+ * operation and the drain to potentially race on the same
+ * CPU. Serialise hotplug versus drain using pcpu_drain_mutex
+ */
+ mutex_lock(&pcpu_drain_mutex);

drain_pages(cpu);
+ mutex_unlock(&pcpu_drain_mutex);

/*
* Spill the event counters of the dead processor

Michal Hocko

unread,

Feb 7, 2017, 6:43:31 AM2/7/17

to Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

[...]

> @@ -6711,7 +6714,16 @@ static int page_alloc_cpu_dead(unsigned int cpu)
> {
>
> lru_add_drain_cpu(cpu);
> +
> + /*
> + * A per-cpu drain via a workqueue from drain_all_pages can be
> + * rescheduled onto an unrelated CPU. That allows the hotplug
> + * operation and the drain to potentially race on the same
> + * CPU. Serialise hotplug versus drain using pcpu_drain_mutex
> + */
> + mutex_lock(&pcpu_drain_mutex);
> drain_pages(cpu);
> + mutex_unlock(&pcpu_drain_mutex);

You cannot put sleepable lock inside the preempt disbaled section...
We can make it a spinlock right?

Vlastimil Babka

unread,

Feb 7, 2017, 6:54:51 AM2/7/17

to Michal Hocko, Mel Gorman, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

Could we do flush_work() with a spinlock? Sounds bad too.
Maybe we could just use the fact that the whole drain happens with disabled
irq's and obtain the current cpu under that protection?

Michal Hocko

unread,

Feb 7, 2017, 7:08:12 AM2/7/17

to Vlastimil Babka, Mel Gorman, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

We surely cannot. I thought the lock would be gone in drain_all_pages,
we would deadlock with the lock there anyway... But it is true that we
would need a way to only allow one caller to get in. This is getting
messier and messier...

> Maybe we could just use the fact that the whole drain happens with disabled
> irq's and obtain the current cpu under that protection?

preempt_disable should be enough, no? The CPU callback is not called
from an IRQ context, right?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1ee49474207e..4a9a65479435 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2343,7 +2343,16 @@ void drain_local_pages(struct zone *zone)

static void drain_local_pages_wq(struct work_struct *work)
{
+ /*

+ * drain_all_pages doesn't use proper cpu hotplug protection so
+ * we can race with cpu offline when the WQ can move this from
+ * a cpu pinned worker to an unbound one. We can operate on a different
+ * cpu which is allright but we also have to make sure to not move to
+ * a different one.

+ */
+ preempt_disable();
drain_local_pages(NULL);

+ preempt_enable();
}

/*
@@ -2379,12 +2388,6 @@ void drain_all_pages(struct zone *zone)
}

/*
- * As this can be called from reclaim context, do not reenter reclaim.
- * An allocation failure can be handled, it's simply slower
- */
- get_online_cpus();
-
- /*

* We don't care about racing with CPU hotplug event

* as offline notification will cause the notified

* cpu to drain that CPU pcps and on_each_cpu_mask

@@ -2423,7 +2426,6 @@ void drain_all_pages(struct zone *zone)

for_each_cpu(cpu, &cpus_with_pcps)
flush_work(per_cpu_ptr(&pcpu_drain, cpu));

- put_online_cpus();
mutex_unlock(&pcpu_drain_mutex);
}

Michal Hocko

unread,

Feb 7, 2017, 7:37:11 AM2/7/17

to Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

Scratch that! For some reason I thought that cpu notifiers are run in an
atomic context. Now that I am checking the code again it turns out I was
wrong. __cpu_notify uses __raw_notifier_call_chain so this is not an
atomic context. Anyway, shouldn't be it sufficient to disable preemption
on drain_local_pages_wq? The CPU hotplug callback will not preempt us
and so we cannot work on the same cpus, right?

Vlastimil Babka

unread,

Feb 7, 2017, 7:43:45 AM2/7/17

to Michal Hocko, Mel Gorman, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On 02/07/2017 01:37 PM, Michal Hocko wrote:
>> > @@ -6711,7 +6714,16 @@ static int page_alloc_cpu_dead(unsigned int cpu)
>> > {
>> >
>> > lru_add_drain_cpu(cpu);
>> > +
>> > + /*
>> > + * A per-cpu drain via a workqueue from drain_all_pages can be
>> > + * rescheduled onto an unrelated CPU. That allows the hotplug
>> > + * operation and the drain to potentially race on the same
>> > + * CPU. Serialise hotplug versus drain using pcpu_drain_mutex
>> > + */
>> > + mutex_lock(&pcpu_drain_mutex);
>> > drain_pages(cpu);
>> > + mutex_unlock(&pcpu_drain_mutex);
>>
>> You cannot put sleepable lock inside the preempt disbaled section...
>> We can make it a spinlock right?
>
> Scratch that! For some reason I thought that cpu notifiers are run in an
> atomic context. Now that I am checking the code again it turns out I was
> wrong. __cpu_notify uses __raw_notifier_call_chain so this is not an
> atomic context.

Good.

> Anyway, shouldn't be it sufficient to disable preemption
> on drain_local_pages_wq? The CPU hotplug callback will not preempt us
> and so we cannot work on the same cpus, right?

I thought the problem here was that the callback races with the work item that
has been migrated to a different cpu. Once we are not working on the local cpu,
disabling preempt/irq's won't help?

Michal Hocko

unread,

Feb 7, 2017, 7:48:38 AM2/7/17

to Vlastimil Babka, Mel Gorman, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue 07-02-17 13:43:39, Vlastimil Babka wrote:
[...]

> > Anyway, shouldn't be it sufficient to disable preemption
> > on drain_local_pages_wq? The CPU hotplug callback will not preempt us
> > and so we cannot work on the same cpus, right?
>
> I thought the problem here was that the callback races with the work item
> that has been migrated to a different cpu. Once we are not working on the
> local cpu, disabling preempt/irq's won't help?

If the worker is racing with the callback than only one of can run on a
_particular_ cpu. So they cannot race. Or am I missing something?

Mel Gorman

unread,

Feb 7, 2017, 8:03:52 AM2/7/17

to Michal Hocko, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

The CPU down callback can hold a mutex and at least he SLUB callback
already does so. That gives

page_alloc_cpu_dead
mutex_lock
drain_pages
mutex_unlock

drain_all_pages
mutex_lock
queue workqueue
drain_local_pages_wq
preempt_disable
drain_local_pages
drain_pages
preempt_enable
flush queues
mutex_unlock

I must be blind or maybe it's rushing between multiple concerns but which
sleepable lock is of concern?

Michal Hocko

unread,

Feb 7, 2017, 8:48:20 AM2/7/17

to Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

I thought the cpu hotplug callback was non-preemptible. This is not the
case as mentioned in other reply. The pcpu_drain_mutex in the hotplug
callback is alright. Sorry about the confusion! I am still wondering
whether the lock is really needed. See the other reply.

Vlastimil Babka

unread,

Feb 7, 2017, 8:58:00 AM2/7/17

to Michal Hocko, Mel Gorman, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

Ah I forgot that migrated work item will in fact run on local cpu. So looks like
nobody should race with the callback indeed (assuming that when the callback is
called, the cpu in question already isn't executing workqueue workers).

Mel Gorman

unread,

Feb 7, 2017, 8:58:48 AM2/7/17

to Michal Hocko, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue, Feb 07, 2017 at 01:37:08PM +0100, Michal Hocko wrote:
> > You cannot put sleepable lock inside the preempt disbaled section...
> > We can make it a spinlock right?
>
> Scratch that! For some reason I thought that cpu notifiers are run in an
> atomic context. Now that I am checking the code again it turns out I was
> wrong. __cpu_notify uses __raw_notifier_call_chain so this is not an
> atomic context.

Indeed.

> Anyway, shouldn't be it sufficient to disable preemption
> on drain_local_pages_wq?

That would be sufficient for a hot-removed CPU moving the drain request
to another CPU and avoiding any scheduling events.

> The CPU hotplug callback will not preempt us
> and so we cannot work on the same cpus, right?
>

I don't see a specific guarantee that it cannot be preempted and it
would depend on an the exact cpu hotplug implementation which is subject
to quite a lot of change. Hence, the mutex provides a guantee that the
hot-removed CPU teardown cannot run on the same CPU as a workqueue drain
running on a CPU it was not originally scheduled for.

Michal Hocko

unread,

Feb 7, 2017, 9:19:13 AM2/7/17

to Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue 07-02-17 13:58:46, Mel Gorman wrote:
> On Tue, Feb 07, 2017 at 01:37:08PM +0100, Michal Hocko wrote:

[...]

> > Anyway, shouldn't be it sufficient to disable preemption
> > on drain_local_pages_wq?
>
> That would be sufficient for a hot-removed CPU moving the drain request
> to another CPU and avoiding any scheduling events.
>
> > The CPU hotplug callback will not preempt us
> > and so we cannot work on the same cpus, right?
> >
>
> I don't see a specific guarantee that it cannot be preempted and it
> would depend on an the exact cpu hotplug implementation which is subject
> to quite a lot of change.

But we do not care about the whole cpu hotplug code. The only part we
really do care about is the race inside drain_pages_zone and that will
run in an atomic context on the specific CPU.

You are absolutely right that using the mutex is safe as well but the
hotplug path is already littered with locks and adding one more to the
picture doesn't sound great to me. So I would really like to not use a
lock if that is possible and safe (with a big fat comment of course).

Michal Hocko

unread,

Feb 7, 2017, 10:35:04 AM2/7/17

to Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

And with the full changelog. I hope I haven't missed anything this time.
---
From 8c6af3116520251cc4ec2213f0a4ed2544bb4365 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mho...@suse.com>
Date: Tue, 7 Feb 2017 16:08:35 +0100
Subject: [PATCH] mm, page_alloc: do not depend on cpu hotplug locks inside the
allocator

Dmitry has reported the following lockdep splat
[<ffffffff81571db1>] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753

[<ffffffff8436697e>] __mutex_lock_common kernel/locking/mutex.c:521 [inline]

[<ffffffff8436697e>] mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
[<ffffffff818f07ea>] pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
[<ffffffff818f0ee4>] __alloc_percpu+0x24/0x30 mm/percpu.c:1075
[<ffffffff816543e3>] smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44
[<ffffffff814240b4>] cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136
[<ffffffff81425821>] cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493
[<ffffffff81427bf3>] _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
[<ffffffff81427d23>] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
[<ffffffff81427d68>] cpu_up+0x18/0x20 kernel/cpu.c:1095
[<ffffffff854ede84>] smp_init+0xe9/0xee kernel/smp.c:564
[<ffffffff85482f81>] kernel_init_freeable+0x439/0x690 init/main.c:1010
[<ffffffff84357083>] kernel_init+0x13/0x180 init/main.c:941

[<ffffffff84377baa>] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

cpu_hotplug_begin
cpu_hotplug.lock
pcpu_alloc
pcpu_alloc_mutex

[<ffffffff81423012>] get_online_cpus+0x62/0x90 kernel/cpu.c:248
[<ffffffff8185fcf8>] drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385

[<ffffffff81865e5d>] __alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline]

[<ffffffff81865e5d>] __alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778
[<ffffffff818681c5>] __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980

[<ffffffff818ed0c1>] __alloc_pages include/linux/gfp.h:426 [inline]

[<ffffffff818ed0c1>] __alloc_pages_node include/linux/gfp.h:439 [inline]

[<ffffffff818ed0c1>] alloc_pages_node include/linux/gfp.h:453 [inline]

[<ffffffff818ed0c1>] pcpu_alloc_pages mm/percpu-vm.c:93 [inline]

[<ffffffff818ed0c1>] pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
[<ffffffff818f0a11>] pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
[<ffffffff818f0eb7>] __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062

[<ffffffff817d25b2>] bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline]

[<ffffffff817d25b2>] array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99

[<ffffffff817ba034>] find_and_alloc_map kernel/bpf/syscall.c:34 [inline]

[<ffffffff817ba034>] map_create kernel/bpf/syscall.c:188 [inline]

[<ffffffff817ba034>] SYSC_bpf kernel/bpf/syscall.c:870 [inline]

[<ffffffff817ba034>] SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827

[<ffffffff84377941>] entry_SYSCALL_64_fastpath+0x1f/0xc2

pcpu_alloc
pcpu_alloc_mutex
drain_all_pages

get_online_cpus
cpu_hotplug.lock

[<ffffffff81427876>] cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304
[<ffffffff81427ada>] _cpu_up+0xca/0x2a0 kernel/cpu.c:1011
[<ffffffff81427d23>] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
[<ffffffff81427d68>] cpu_up+0x18/0x20 kernel/cpu.c:1095
[<ffffffff854ede84>] smp_init+0xe9/0xee kernel/smp.c:564
[<ffffffff85482f81>] kernel_init_freeable+0x439/0x690 init/main.c:1010
[<ffffffff84357083>] kernel_init+0x13/0x180 init/main.c:941

[<ffffffff84377baa>] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

cpu_hotplug_begin
cpu_hotplug.lock

Pulling cpu hotplug locks inside the page allocator is just too
dangerous. Let's remove the dependency by dropping get_online_cpus()
from drain_all_pages. This is not so simple though because now we do not
have a protection against cpu hotplug which means 2 things:
- the work item might be executed on a different cpu in worker
from unbound pool so it doesn't run on pinned on the cpu
- we have to make sure that we do not race with page_alloc_cpu_dead
calling drain_pages_zone

Disabling preemption in drain_local_pages_wq will solve the first
problem drain_local_pages will determine its local CPU from the WQ
context which will be stable after that point, page_alloc_cpu_dead
is pinned to the CPU already. The later condition is achieved
by disabling IRQs in drain_pages_zone.

Reported-by: Dmitry Vyukov <dvy...@google.com>
Signed-off-by: Michal Hocko <mho...@suse.com>
---
mm/page_alloc.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3358d4f7932..b6411816787a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2343,7 +2343,16 @@ void drain_local_pages(struct zone *zone)

static void drain_local_pages_wq(struct work_struct *work)
{
+ /*

+ * drain_all_pages doesn't use proper cpu hotplug protection so
+ * we can race with cpu offline when the WQ can move this from
+ * a cpu pinned worker to an unbound one. We can operate on a different
+ * cpu which is allright but we also have to make sure to not move to
+ * a different one.

+ */
+ preempt_disable();
drain_local_pages(NULL);

+ preempt_enable();
}

/*
@@ -2379,12 +2388,6 @@ void drain_all_pages(struct zone *zone)
}

/*
- * As this can be called from reclaim context, do not reenter reclaim.
- * An allocation failure can be handled, it's simply slower
- */
- get_online_cpus();
-
- /*

* We don't care about racing with CPU hotplug event

* as offline notification will cause the notified

* cpu to drain that CPU pcps and on_each_cpu_mask

@@ -2423,7 +2426,6 @@ void drain_all_pages(struct zone *zone)

for_each_cpu(cpu, &cpus_with_pcps)
flush_work(per_cpu_ptr(&pcpu_drain, cpu));

- put_online_cpus();
mutex_unlock(&pcpu_drain_mutex);
}

--
2.11.0

Mel Gorman

unread,

Feb 7, 2017, 11:22:27 AM2/7/17

to Michal Hocko, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue, Feb 07, 2017 at 04:34:59PM +0100, Michal Hocko wrote:
> > But we do not care about the whole cpu hotplug code. The only part we
> > really do care about is the race inside drain_pages_zone and that will
> > run in an atomic context on the specific CPU.
> >
> > You are absolutely right that using the mutex is safe as well but the
> > hotplug path is already littered with locks and adding one more to the
> > picture doesn't sound great to me. So I would really like to not use a
> > lock if that is possible and safe (with a big fat comment of course).
>
> And with the full changelog. I hope I haven't missed anything this time.
> ---
> From 8c6af3116520251cc4ec2213f0a4ed2544bb4365 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mho...@suse.com>
> Date: Tue, 7 Feb 2017 16:08:35 +0100
> Subject: [PATCH] mm, page_alloc: do not depend on cpu hotplug locks inside the
> allocator
>

> <SNIP>

>
> Reported-by: Dmitry Vyukov <dvy...@google.com>
> Signed-off-by: Michal Hocko <mho...@suse.com>

Not that I can think of. It's almost identical to the diff I posted with
the exception of the mutex in the cpu hotplug teardown path. I agree that
in the current implementation that it should be unnecessary even if I
thought it would be more robust against any other hotplug churn.

Acked-by: Mel Gorman <mgo...@techsingularity.net>

Michal Hocko

unread,

Feb 7, 2017, 11:41:34 AM2/7/17

to Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

I am always nervous when seeing hotplug locks being used in low level
code. It has bitten us several times already and those deadlocks are
quite hard to spot when reviewing the code and very rare to hit so they
tend to live for a long time.

> Acked-by: Mel Gorman <mgo...@techsingularity.net>

Thanks! I will wait for Tejun to confirm my assumptions are correct and
post the patch to Andrew if there are no further problems spotted. Btw.
this will also get rid of another lockdep report which seem to be false
possitive though
http://lkml.kernel.org/r/20170203145...@dhcp22.suse.cz

Christoph Lameter

unread,

Feb 7, 2017, 11:55:56 AM2/7/17

to Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue, 7 Feb 2017, Michal Hocko wrote:

> I am always nervous when seeing hotplug locks being used in low level
> code. It has bitten us several times already and those deadlocks are
> quite hard to spot when reviewing the code and very rare to hit so they
> tend to live for a long time.

Yep. Hotplug events are pretty significant. Using stop_machine_XXXX() etc
would be advisable and that would avoid the taking of locks and get rid of all the
ocmplexity, reduce the code size and make the overall system much more
reliable.

Thomas?

Tejun Heo

unread,

Feb 7, 2017, 12:03:22 PM2/7/17

to Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

Hello,

Sorry about the delay.

On Tue, Feb 07, 2017 at 04:34:59PM +0100, Michal Hocko wrote:

I think this would work; however, a more canonical way would be
something along the line of...

drain_all_pages()
{
...
spin_lock();
for_each_possible_cpu() {
if (this cpu should get drained) {
queue_work_on(this cpu's work);
}
}
spin_unlock();
...
}

offline_hook()
{
spin_lock();
this cpu should get drained = false;
spin_unlock();
queue_work_on(this cpu's work);
flush_work(this cpu's work);
}

I think what workqueue should do is automatically flush in-flight CPU
work items on CPU offline and erroring out on queue_work_on() on
offline CPUs. And we now actually can do that because we have lifted
the guarantee that queue_work() is local CPU affine some releases ago.
I'll look into it soonish.

For the time being, either approach should be fine. The more
canonical one might be a bit less surprising but the
preempt_disable/enable() change is short and sweet and completely fine
for the case at hand.

Please feel free to add

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

--
tejun

Michal Hocko

unread,

Feb 7, 2017, 3:16:34 PM2/7/17

to Tejun Heo, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Christoph Lameter, linu...@kvack.org, LKML, Thomas Gleixner, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

I see

> I think what workqueue should do is automatically flush in-flight CPU
> work items on CPU offline and erroring out on queue_work_on() on
> offline CPUs. And we now actually can do that because we have lifted
> the guarantee that queue_work() is local CPU affine some releases ago.
> I'll look into it soonish.
>
> For the time being, either approach should be fine. The more
> canonical one might be a bit less surprising but the
> preempt_disable/enable() change is short and sweet and completely fine
> for the case at hand.

Thanks for double checking!

> Please feel free to add
>
> Acked-by: Tejun Heo <t...@kernel.org>

Thanks!

Thomas Gleixner

unread,

Feb 7, 2017, 4:53:41 PM2/7/17

to Dmitry Vyukov, Vlastimil Babka, Tejun Heo, Christoph Lameter, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Mel Gorman, Michal Hocko, Andrew Morton

On Mon, 6 Feb 2017, Dmitry Vyukov wrote:
> On Mon, Jan 30, 2017 at 4:48 PM, Dmitry Vyukov <dvy...@google.com> wrote:
> Unfortunately it does not seem to help.

> Fuzzer now runs on 510948533b059f4f5033464f9f4a0c32d4ab0c08 of
> mmotm/auto-latest
> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git):
>
> commit 510948533b059f4f5033464f9f4a0c32d4ab0c08
> Date: Thu Feb 2 10:08:47 2017 +0100
> mmotm: userfaultfd-non-cooperative-add-event-for-memory-unmaps-fix
>
> The commit you referenced is already there:
>
> commit 806b158031ca0b4714e775898396529a758ebc2c
> Date: Thu Feb 2 08:53:16 2017 +0100

> mm, page_alloc: use static global work_struct for draining per-cpu pages

<SNIP>

> Chain exists of:
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(pcpu_alloc_mutex);
> lock(cpu_hotplug.lock);
> lock(pcpu_alloc_mutex);
> lock(cpu_hotplug.dep_map);

And that's exactly what happens:

cpu_up()
alloc_percpu() lock(hotplug.lock)
lock(&pcpu_alloc_mutex)
.. alloc_percpu()
drain_all_pages() lock(&pcpu_alloc_mutex)
get_online_cpus()
lock(hotplug.lock)

Classic deadlock, i.e. you _cannot_ call get_online_cpus() while holding
pcpu_alloc_mutex.

Alternatively you can forbid to do per cpu alloc/free while holding
hotplug.lock. I doubt that this will make people happy :)

Thanks,

tglx

Thomas Gleixner

unread,

Feb 7, 2017, 5:25:28 PM2/7/17

to Christoph Lameter, Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

Huch? stop_machine() is horrible and heavy weight. Don't go there, there
must be simpler solutions than that.

Thanks,

tglx

Michal Hocko

unread,

Feb 8, 2017, 2:35:30 AM2/8/17

to Thomas Gleixner, Christoph Lameter, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

Absolutely agreed. We are in the page allocator path so using the
stop_machine* is just ridiculous. And, in fact, there is a much simpler
solution [1]

[1] http://lkml.kernel.org/r/20170207201950...@kernel.org

Thomas Gleixner

unread,

Feb 8, 2017, 7:02:15 AM2/8/17

to Michal Hocko, Christoph Lameter, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

Well, yes. It's simple, but from an RT point of view I really don't like
it as we have to fix it up again.

On RT we solved the problem of the page allocator differently which allows
us to do drain_all_pages() from the caller CPU as a side effect. That's
interesting not only for RT, it's also interesting for NOHZ FULL scenarios
because you don't inflict the work on the other CPUs.

https://git.kernel.org/cgit/linux/kernel/git/rt/linux-rt-devel.git/commit/?h=linux-4.9.y-rt-rebase&id=d577a017da694e29a06af057c517f2a7051eb305

That uses local locks (an RT speciality which compile away into preempt/irq
disable/enable when RT is disabled).

Works like a charm :)

Thanks,

tglx

Michal Hocko

unread,

Feb 8, 2017, 7:21:58 AM2/8/17

to Thomas Gleixner, Christoph Lameter, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Wed 08-02-17 13:02:07, Thomas Gleixner wrote:
> On Wed, 8 Feb 2017, Michal Hocko wrote:

[...]

> > [1] http://lkml.kernel.org/r/20170207201950...@kernel.org
>
> Well, yes. It's simple, but from an RT point of view I really don't like
> it as we have to fix it up again.

I thought that preempt_disable would turn into migrate_disable or
something like that which shouldn't cause too much trouble. Or am I
missing something? Which part of the patch is so RT unfriendly?

Mel Gorman

unread,

Feb 8, 2017, 7:26:13 AM2/8/17

to Thomas Gleixner, Michal Hocko, Christoph Lameter, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

It may be worth noting that patches in Andrew's tree no longer disable
interrupts in the per-cpu allocator and now per-cpu draining will
be from workqueue context. The reasoning was due to the overhead of
the page allocator with figures included. Interrupts will bypass the
per-cpu allocator and use the irq-safe zone->lock to allocate from
the core. It'll collide with the RT patch. Primary patch of interest is
http://www.ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch

The draining from workqueue context may be a problem for RT but one
option would be to move the drain to only drain for high-order pages
after direct reclaim combined with only draining for order-0 if
__alloc_pages_may_oom is about to be called.

Thomas Gleixner

unread,

Feb 8, 2017, 8:23:51 AM2/8/17

to Mel Gorman, Michal Hocko, Christoph Lameter, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Wed, 8 Feb 2017, Mel Gorman wrote:
> It may be worth noting that patches in Andrew's tree no longer disable
> interrupts in the per-cpu allocator and now per-cpu draining will
> be from workqueue context. The reasoning was due to the overhead of
> the page allocator with figures included. Interrupts will bypass the
> per-cpu allocator and use the irq-safe zone->lock to allocate from
> the core. It'll collide with the RT patch. Primary patch of interest is
> http://www.ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch

Yeah, we'll sort that out once it hits Linus tree and we move RT forward.
Though I have once complaint right away:

+ preempt_enable_no_resched();

This is a nono, even in mainline. You effectively disable a preemption
point.

> The draining from workqueue context may be a problem for RT but one
> option would be to move the drain to only drain for high-order pages
> after direct reclaim combined with only draining for order-0 if
> __alloc_pages_may_oom is about to be called.

Why would the draining from workqueue context be an issue on RT?

Thanks,

tglx

Mel Gorman

unread,

Feb 8, 2017, 9:03:33 AM2/8/17

to Thomas Gleixner, Michal Hocko, Christoph Lameter, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Wed, Feb 08, 2017 at 02:23:19PM +0100, Thomas Gleixner wrote:
> On Wed, 8 Feb 2017, Mel Gorman wrote:
> > It may be worth noting that patches in Andrew's tree no longer disable
> > interrupts in the per-cpu allocator and now per-cpu draining will
> > be from workqueue context. The reasoning was due to the overhead of
> > the page allocator with figures included. Interrupts will bypass the
> > per-cpu allocator and use the irq-safe zone->lock to allocate from
> > the core. It'll collide with the RT patch. Primary patch of interest is
> > http://www.ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch
>
> Yeah, we'll sort that out once it hits Linus tree and we move RT forward.
> Though I have once complaint right away:
>
> + preempt_enable_no_resched();
>
> This is a nono, even in mainline. You effectively disable a preemption
> point.
>

This came up during review on whether it should or shouldn't be a preemption
point. Initially it was preempt_enable() but a preemption point didn't
exist before, the reviewer pushed for it and as it was the allocator fast
path that was unlikely to need a reschedule or preempt, I made the change.

I can alter it before it hits mainline if you say RT is going to have an
issue with it.

> > The draining from workqueue context may be a problem for RT but one
> > option would be to move the drain to only drain for high-order pages
> > after direct reclaim combined with only draining for order-0 if
> > __alloc_pages_may_oom is about to be called.
>
> Why would the draining from workqueue context be an issue on RT?
>

It probably isn't. The latency of the operation is likely longer than an IPI
was but given the context it occurs in, I severely doubted it mattered. I
couldn't think of a reason why it would matter to RT but there was no harm
double checking.

Peter Zijlstra

unread,

Feb 8, 2017, 9:11:28 AM2/8/17

to Mel Gorman, Thomas Gleixner, Michal Hocko, Christoph Lameter, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, syzkaller, Andrew Morton

On Wed, Feb 08, 2017 at 02:03:32PM +0000, Mel Gorman wrote:
> > Yeah, we'll sort that out once it hits Linus tree and we move RT forward.
> > Though I have once complaint right away:
> >
> > + preempt_enable_no_resched();
> >
> > This is a nono, even in mainline. You effectively disable a preemption
> > point.
> >
>
> This came up during review on whether it should or shouldn't be a preemption
> point. Initially it was preempt_enable() but a preemption point didn't
> exist before, the reviewer pushed for it and as it was the allocator fast
> path that was unlikely to need a reschedule or preempt, I made the change.

Not relevant. The only acceptable use of preempt_enable_no_resched() is
if the next statement is a schedule() variant.

Christoph Lameter

unread,

Feb 8, 2017, 10:06:51 AM2/8/17

to Thomas Gleixner, Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Tue, 7 Feb 2017, Thomas Gleixner wrote:

> > Yep. Hotplug events are pretty significant. Using stop_machine_XXXX() etc
> > would be advisable and that would avoid the taking of locks and get rid of all the
> > ocmplexity, reduce the code size and make the overall system much more
> > reliable.
>
> Huch? stop_machine() is horrible and heavy weight. Don't go there, there
> must be simpler solutions than that.

Inserting or removing hardware is a heavy process. This would help quite a
bit with these issues for loops over active cpus.

Christoph Lameter

unread,

Feb 8, 2017, 10:11:09 AM2/8/17

to Michal Hocko, Thomas Gleixner, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Wed, 8 Feb 2017, Michal Hocko wrote:

> > Huch? stop_machine() is horrible and heavy weight. Don't go there, there
> > must be simpler solutions than that.
>
> Absolutely agreed. We are in the page allocator path so using the
> stop_machine* is just ridiculous. And, in fact, there is a much simpler
> solution [1]

That is nonsense. stop_machine would be used when adding removing a
processor. There would be no need to synchronize when looping over active
cpus anymore. get_online_cpus() etc would be removed from the hot
path since the cpu masks are guaranteed to be stable.

Michal Hocko

unread,

Feb 8, 2017, 10:21:10 AM2/8/17

to Christoph Lameter, Thomas Gleixner, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

I have no idea what you are trying to say and how this is related to the
deadlock we are discussing here. We certainly do not need to add
stop_machine the problem. And yeah, dropping get_online_cpus was
possible after considering all fallouts.

Christoph Lameter

unread,

Feb 8, 2017, 11:17:23 AM2/8/17

to Michal Hocko, Thomas Gleixner, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

This is not the first time get_online_cpus() causes problems due to the
need to support hotplug for processors. Hotplugging is not happening
frequently (which is low balling it. Actually the frequency of the hotplug
events on almost all systems is zero) so the constant check is a useless
overhead and causes trouble for development. In particular
get_online_cpus() is often needed in sections that need to hold locks.

So lets get rid of it. The severity, frequency and rarity of processor
hotplug events would justify only allowing adding and removal of
processors through the stop_machine_xx mechanism. With that in place the
processor masks can be used without synchronization and the locking issues
all over the kernel would become simpler.

It is likely that this will even improve the hotplug code because the
easier form of synchronization (you have a piece of code that executed
while the OS is in stop state) would allow to make more significant
changes to the software environment. F.e. one could think about removing
memory segments as well as maybe per cpu segments.

Thomas Gleixner

unread,

Feb 8, 2017, 12:46:39 PM2/8/17

to Christoph Lameter, Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Wed, 8 Feb 2017, Christoph Lameter wrote:
> On Wed, 8 Feb 2017, Michal Hocko wrote:
>
> > I have no idea what you are trying to say and how this is related to the
> > deadlock we are discussing here. We certainly do not need to add
> > stop_machine the problem. And yeah, dropping get_online_cpus was
> > possible after considering all fallouts.
>
> This is not the first time get_online_cpus() causes problems due to the
> need to support hotplug for processors. Hotplugging is not happening
> frequently (which is low balling it. Actually the frequency of the hotplug
> events on almost all systems is zero) so the constant check is a useless
> overhead and causes trouble for development. In particular

There is a world outside yours. Hotplug is actually used frequently for
power purposes in some scenarios.

> get_online_cpus() is often needed in sections that need to hold locks.
>
> So lets get rid of it. The severity, frequency and rarity of processor
> hotplug events would justify only allowing adding and removal of
> processors through the stop_machine_xx mechanism. With that in place the
> processor masks can be used without synchronization and the locking issues
> all over the kernel would become simpler.
>
> It is likely that this will even improve the hotplug code because the
> easier form of synchronization (you have a piece of code that executed
> while the OS is in stop state) would allow to make more significant
> changes to the software environment. F.e. one could think about removing
> memory segments as well as maybe per cpu segments.

It will improve nothing. The stop machine context is extremly limited and
you cannot do complex things there at all. Not to talk about the inability
of taking a simple mutex which would immediately deadlock the machine.

stop machine is the last resort for things which need to be done atomically
and that operation can be done in a very restricted context.

And everything complex needs to be done _before_ that in normal
context. Hot unplug already uses stop machine for the final removal of the
outgoing CPU, but that's definitely not the place where you can do anything
complex like page management.

If you can prepare the outgoing cpu work during the cpu offline phase and
then just flip a bit in the stop machine part, then this might work, but
anything else is just handwaving and proliferation of wet dreams.

Thanks,

tglx

Christoph Lameter

unread,

Feb 8, 2017, 10:15:08 PM2/8/17

to Thomas Gleixner, Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Wed, 8 Feb 2017, Thomas Gleixner wrote:

> There is a world outside yours. Hotplug is actually used frequently for
> power purposes in some scenarios.

The usual case does not inolve hotplug.

> It will improve nothing. The stop machine context is extremly limited and
> you cannot do complex things there at all. Not to talk about the inability
> of taking a simple mutex which would immediately deadlock the machine.

You do not need to do complex things. Basically flipping some cpu mask
bits will do it. stop machine ensures that code is not
executing on the processors when the bits are flipped. That will ensure
that there is no need to do any get_online_cpu() nastiness in critical VM
paths since we are guaranteed not to be executing them.

> And everything complex needs to be done _before_ that in normal
> context. Hot unplug already uses stop machine for the final removal of the
> outgoing CPU, but that's definitely not the place where you can do anything
> complex like page management.

If it already does that then why do we still need get_online_cpu()? We do
not do anything like page management. Why would we? We just need to ensure
that nothing is executing when the bits are flipped. If that is the case
then the get_online_cpu(0 calls are unecessary because the bit flipping
simply cannot occur in these functions. There is nothing to serialize
against.

> If you can prepare the outgoing cpu work during the cpu offline phase and
> then just flip a bit in the stop machine part, then this might work, but
> anything else is just handwaving and proliferation of wet dreams.

Fine with that.

Thomas Gleixner

unread,

Feb 9, 2017, 6:43:19 AM2/9/17

to Christoph Lameter, Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Wed, 8 Feb 2017, Christoph Lameter wrote:
> On Wed, 8 Feb 2017, Thomas Gleixner wrote:
>
> > There is a world outside yours. Hotplug is actually used frequently for
> > power purposes in some scenarios.
>
> The usual case does not inolve hotplug.

We do not care about your definition of "usual". The kernel serves _ALL_
use cases.

> > It will improve nothing. The stop machine context is extremly limited and
> > you cannot do complex things there at all. Not to talk about the inability
> > of taking a simple mutex which would immediately deadlock the machine.
>
> You do not need to do complex things. Basically flipping some cpu mask
> bits will do it. stop machine ensures that code is not
> executing on the processors when the bits are flipped. That will ensure
> that there is no need to do any get_online_cpu() nastiness in critical VM
> paths since we are guaranteed not to be executing them.

And how does that solve the problem at hand? Not at all:

CPU 0 CPU 1

for_each_online_cpu(cpu)
==> cpu = 1
stop_machine()
set_cpu_online(1, false)
queue_work(cpu1)

Thanks,

tglx

Christoph Lameter

unread,

Feb 9, 2017, 9:01:00 AM2/9/17

to Thomas Gleixner, Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Thu, 9 Feb 2017, Thomas Gleixner wrote:

> And how does that solve the problem at hand? Not at all:
>
> CPU 0 CPU 1
>
> for_each_online_cpu(cpu)
> ==> cpu = 1
> stop_machine()
> set_cpu_online(1, false)
> queue_work(cpu1)
>
> Thanks,

Well thats not how I remember stop_machine does work. Doesnt it stop the
processing on all cpus otherwise its not a real usable stop.

The stop_machine would need to ensure that all cpus cease processing
before proceeding.

Thomas Gleixner

unread,

Feb 9, 2017, 9:53:35 AM2/9/17

to Christoph Lameter, Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

Ok. I try again:

CPU 0 CPU 1
for_each_online_cpu(cpu)
==> cpu = 1
stop_machine()

Stops processing on all CPUs by preempting the current execution and
forcing them into a high priority busy loop with interrupts disabled.

context_switch()
stomper_thread()
busyloop()

set_cpu_online(1, false)

stop_machine end()
release busy looping CPUs

context_switch

Resumes operation at the preemption point. cpu is still 1

queue_work(cpu == 1)

It does exactly what you describe. It stops processing on all other cpus
until release, but that does not invalidate any data on those cpus.

It's been that way forever.

Thanks,

tglx

Christoph Lameter

unread,

Feb 9, 2017, 10:42:12 AM2/9/17

to Thomas Gleixner, Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Thu, 9 Feb 2017, Thomas Gleixner wrote:

> > The stop_machine would need to ensure that all cpus cease processing
> > before proceeding.
>
> Ok. I try again:
>
> CPU 0 CPU 1
> for_each_online_cpu(cpu)
> ==> cpu = 1
> stop_machine()
>
> Stops processing on all CPUs by preempting the current execution and
> forcing them into a high priority busy loop with interrupts disabled.

Exactly that means we are outside of the sections marked with
get_online_cpous().

> It does exactly what you describe. It stops processing on all other cpus
> until release, but that does not invalidate any data on those cpus.

Why would it need to invalidate any data? The change of the cpu masks
would need to be done when the machine is stopped. This sounds exactly
like what we need and much of it is already there.

Lets get rid of get_online_cpus() etc.

Thomas Gleixner

unread,

Feb 9, 2017, 11:13:24 AM2/9/17

to Christoph Lameter, Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

You are just not getting it, really.

The problem is that this for_each_online_cpu() is racy against a concurrent
hot unplug and therefor can queue stuff for a not longer online cpu. That's
what the mm folks tried to avoid by preventing a CPU hotplug operation
before entering that loop.

> Lets get rid of get_online_cpus() etc.

And that solves what?

Can you please start to understand the scope of the whole hotplug machinery
including the requirements for get_online_cpus() before you waste
everybodys time with your uninformed and halfbaken proposals?

Thanks,

tglx

Christoph Lameter

unread,

Feb 9, 2017, 12:22:53 PM2/9/17

to Thomas Gleixner, Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Thu, 9 Feb 2017, Thomas Gleixner wrote:

> You are just not getting it, really.
>
> The problem is that this for_each_online_cpu() is racy against a concurrent
> hot unplug and therefor can queue stuff for a not longer online cpu. That's
> what the mm folks tried to avoid by preventing a CPU hotplug operation
> before entering that loop.

With a stop machine action it is NOT racy because the machine goes into a
special kernel state that guarantees that key operating system structures
are not touched. See mm/page_alloc.c's use of that characteristic to build
zonelists. Thus it cannot be executing for_each_online_cpu and related
tasks (unless one does not disable preempt .... but that is a given if a
spinlock has been taken)..

> > Lets get rid of get_online_cpus() etc.
>
> And that solves what?

It gets rid of future issues with serialization in paths were we need to
lock and still do for_each_online_cpu().

> Can you please start to understand the scope of the whole hotplug machinery
> including the requirements for get_online_cpus() before you waste
> everybodys time with your uninformed and halfbaken proposals?

Its an obvious solution to the issues that have arisen multiple times with
get_online_cpus() within the slab allocators. The hotplug machinery should
make things as easy as possible for other people and having these
get_online_cpus() everywhere does complicate things.

Thomas Gleixner

unread,

Feb 9, 2017, 12:40:55 PM2/9/17

to Christoph Lameter, Michal Hocko, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Thu, 9 Feb 2017, Christoph Lameter wrote:
> On Thu, 9 Feb 2017, Thomas Gleixner wrote:
>
> > You are just not getting it, really.
> >
> > The problem is that this for_each_online_cpu() is racy against a concurrent
> > hot unplug and therefor can queue stuff for a not longer online cpu. That's
> > what the mm folks tried to avoid by preventing a CPU hotplug operation
> > before entering that loop.
>
> With a stop machine action it is NOT racy because the machine goes into a
> special kernel state that guarantees that key operating system structures
> are not touched. See mm/page_alloc.c's use of that characteristic to build
> zonelists. Thus it cannot be executing for_each_online_cpu and related
> tasks (unless one does not disable preempt .... but that is a given if a
> spinlock has been taken)..

drain_all_pages() is called from preemptible context. So what are you
talking about again?

> > > Lets get rid of get_online_cpus() etc.
> >
> > And that solves what?
>
> It gets rid of future issues with serialization in paths were we need to
> lock and still do for_each_online_cpu().

There are code pathes which might sleep inside the loop so
get_online_cpus() is the only way to serialize against hotplug.

Just because the only tool you know is stop machine it does not make
everything an atomic context where it can be applied.

> > Can you please start to understand the scope of the whole hotplug machinery
> > including the requirements for get_online_cpus() before you waste
> > everybodys time with your uninformed and halfbaken proposals?
>
> Its an obvious solution to the issues that have arisen multiple times with
> get_online_cpus() within the slab allocators. The hotplug machinery should
> make things as easy as possible for other people and having these
> get_online_cpus() everywhere does complicate things.

It's no obvious solution to everything. It's context dependend and people
have to think hard how to solve their problem within the context they are
dealing with.

Your 'get rid of get_online_cpus()' mantra does make all of this magically
simple. Relying on the fact, that the CPU online bit is cleared via
stomp_machine(), which is a horrible operation in various aspects, only
applies to a very small subset of problems. You can repeat your mantra
another thousand times and that will not make the way larger set of
problems magically disappear.

Thanks,

tglx

Michal Hocko

unread,

Feb 9, 2017, 2:15:51 PM2/9/17

to Christoph Lameter, Thomas Gleixner, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Thu 09-02-17 11:22:49, Cristopher Lameter wrote:
> On Thu, 9 Feb 2017, Thomas Gleixner wrote:
>
> > You are just not getting it, really.
> >
> > The problem is that this for_each_online_cpu() is racy against a concurrent
> > hot unplug and therefor can queue stuff for a not longer online cpu. That's
> > what the mm folks tried to avoid by preventing a CPU hotplug operation
> > before entering that loop.
>
> With a stop machine action it is NOT racy because the machine goes into a
> special kernel state that guarantees that key operating system structures
> are not touched. See mm/page_alloc.c's use of that characteristic to build
> zonelists. Thus it cannot be executing for_each_online_cpu and related
> tasks (unless one does not disable preempt .... but that is a given if a
> spinlock has been taken)..

Christoph, you are completely ignoring the reality and the code. There
is no need for stop_machine nor it is helping anything. As the matter
of fact there is a synchronization with the cpu hotplug needed if you
want to make a per-cpu specific operations. get_online_cpus is the
most straightforward and heavy weight way to do this synchronization
but not the only one. As the patch [1] describes we do not really need
get_online_cpus in drain_all_pages because we can do _better_. But
this is not in any way a generic thing applicable to other code paths.

If you disagree then you are free to post patches but hand waving you
are doing here is just wasting everybody's time. So please cut it here
unless you have specific proposals to improve the current situation.

Thanks!

[1] http://lkml.kernel.org/r/20170207201950...@kernel.org

Christoph Lameter

unread,

Feb 10, 2017, 12:58:55 PM2/10/17

to Michal Hocko, Thomas Gleixner, Mel Gorman, Vlastimil Babka, Dmitry Vyukov, Tejun Heo, linu...@kvack.org, LKML, Ingo Molnar, Peter Zijlstra, syzkaller, Andrew Morton

On Thu, 9 Feb 2017, Michal Hocko wrote:

> Christoph, you are completely ignoring the reality and the code. There
> is no need for stop_machine nor it is helping anything. As the matter
> of fact there is a synchronization with the cpu hotplug needed if you
> want to make a per-cpu specific operations. get_online_cpus is the
> most straightforward and heavy weight way to do this synchronization
> but not the only one. As the patch [1] describes we do not really need
> get_online_cpus in drain_all_pages because we can do _better_. But
> this is not in any way a generic thing applicable to other code paths.
>
> If you disagree then you are free to post patches but hand waving you
> are doing here is just wasting everybody's time. So please cut it here
> unless you have specific proposals to improve the current situation.

I am fine with the particular solution here for this particular problem.

My problem is the general way of having to synchronize via
get_online_cpus() because of cpu hotplug operations.

Reply all

Reply to author

Forward

mm: deadlock between get_online_cpus/pcpu_alloc

Dmitry Vyukov

Vlastimil Babka

Dmitry Vyukov

Dmitry Vyukov

Mel Gorman

Michal Hocko

Michal Hocko

Vlastimil Babka

Mel Gorman

Mel Gorman

Vlastimil Babka

Michal Hocko

Michal Hocko

Mel Gorman

Michal Hocko

Mel Gorman

Mel Gorman

Tetsuo Handa

Mel Gorman

Michal Hocko

Vlastimil Babka

Michal Hocko

Michal Hocko

Vlastimil Babka

Michal Hocko

Mel Gorman

Michal Hocko

Vlastimil Babka

Mel Gorman

Michal Hocko

Michal Hocko

Mel Gorman

Michal Hocko

Christoph Lameter

Tejun Heo

Michal Hocko

Thomas Gleixner

Thomas Gleixner

Michal Hocko

Thomas Gleixner

Michal Hocko

Mel Gorman

Thomas Gleixner

Mel Gorman

Peter Zijlstra

Christoph Lameter

Christoph Lameter

Michal Hocko

Christoph Lameter

Thomas Gleixner

Christoph Lameter

Thomas Gleixner

Christoph Lameter

Thomas Gleixner

Christoph Lameter

Thomas Gleixner

Christoph Lameter

Thomas Gleixner

Michal Hocko

Christoph Lameter