[syzbot] INFO: task hung in io_wqe_worker

24 views
Skip to first unread message

syzbot

unread,
Oct 21, 2021, 5:10:27 PM10/21/21
to asml.s...@gmail.com, ax...@kernel.dk, io-u...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzbot found the following issue on:

HEAD commit: d999ade1cc86 Merge tag 'perf-tools-fixes-for-v5.15-2021-10..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=136f87d0b00000
kernel config: https://syzkaller.appspot.com/x/.config?x=bab9d35f204746a7
dashboard link: https://syzkaller.appspot.com/bug?extid=27d62ee6f256b186883e
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10d3f7ccb00000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15d3600cb00000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+27d62e...@syzkaller.appspotmail.com

INFO: task iou-wrk-6609:6612 blocked for more than 143 seconds.
Not tainted 5.15.0-rc5-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:iou-wrk-6609 state:D stack:27944 pid: 6612 ppid: 6526 flags:0x00004006
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xb44/0x5960 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857
do_wait_for_common kernel/sched/completion.c:85 [inline]
__wait_for_common kernel/sched/completion.c:106 [inline]
wait_for_common kernel/sched/completion.c:117 [inline]
wait_for_completion+0x176/0x280 kernel/sched/completion.c:138
io_worker_exit fs/io-wq.c:183 [inline]
io_wqe_worker+0x66d/0xc40 fs/io-wq.c:597
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

Showing all locks held in the system:
1 lock held by khungtaskd/27:
#0: ffffffff8b981ae0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6446

=============================================

NMI backtrace for cpu 0
CPU: 0 PID: 27 Comm: khungtaskd Not tainted 5.15.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
nmi_cpu_backtrace.cold+0x47/0x144 lib/nmi_backtrace.c:105
nmi_trigger_cpumask_backtrace+0x1ae/0x220 lib/nmi_backtrace.c:62
trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
check_hung_uninterruptible_tasks kernel/hung_task.c:210 [inline]
watchdog+0xc1d/0xf50 kernel/hung_task.c:295
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 1414 Comm: kworker/u4:5 Not tainted 5.15.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events_unbound toggle_allocation_gate
RIP: 0010:bytes_is_nonzero mm/kasan/generic.c:85 [inline]
RIP: 0010:memory_is_nonzero mm/kasan/generic.c:102 [inline]
RIP: 0010:memory_is_poisoned_n mm/kasan/generic.c:128 [inline]
RIP: 0010:memory_is_poisoned mm/kasan/generic.c:159 [inline]
RIP: 0010:check_region_inline mm/kasan/generic.c:180 [inline]
RIP: 0010:kasan_check_range+0xde/0x180 mm/kasan/generic.c:189
Code: 74 f2 48 89 c2 b8 01 00 00 00 48 85 d2 75 56 5b 5d 41 5c c3 48 85 d2 74 5e 48 01 ea eb 09 48 83 c0 01 48 39 d0 74 50 80 38 00 <74> f2 eb d4 41 bc 08 00 00 00 48 89 ea 45 29 dc 4d 8d 1c 2c eb 0c
RSP: 0018:ffffc90005aa7988 EFLAGS: 00000046
RAX: ffffed10021cd084 RBX: ffffed10021cd085 RCX: ffffffff81348c59
RDX: ffffed10021cd085 RSI: 0000000000000008 RDI: ffff888010e68420
RBP: ffffed10021cd084 R08: 0000000000000000 R09: ffff888010e68427
R10: ffffed10021cd084 R11: 000000000000003f R12: ffffffff8baabbe0
R13: ffff888010e68420 R14: 0000000000000000 R15: ffff88801dfeda50
FS: 0000000000000000(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f93e8906000 CR3: 000000000b68e000 CR4: 0000000000350ee0
Call Trace:
instrument_atomic_read include/linux/instrumented.h:71 [inline]
atomic64_read include/linux/atomic/atomic-instrumented.h:605 [inline]
switch_mm_irqs_off+0x1e9/0xa10 arch/x86/mm/tlb.c:615
use_temporary_mm arch/x86/kernel/alternative.c:741 [inline]
__text_poke+0x447/0x8c0 arch/x86/kernel/alternative.c:838
text_poke_bp_batch+0x3d7/0x560 arch/x86/kernel/alternative.c:1178
text_poke_flush arch/x86/kernel/alternative.c:1268 [inline]
text_poke_flush arch/x86/kernel/alternative.c:1265 [inline]
text_poke_finish+0x16/0x30 arch/x86/kernel/alternative.c:1275
arch_jump_label_transform_apply+0x13/0x20 arch/x86/kernel/jump_label.c:146
jump_label_update+0x1d5/0x430 kernel/jump_label.c:830
static_key_enable_cpuslocked+0x1b1/0x260 kernel/jump_label.c:177
static_key_enable+0x16/0x20 kernel/jump_label.c:190
toggle_allocation_gate mm/kfence/core.c:626 [inline]
toggle_allocation_gate+0x100/0x390 mm/kfence/core.c:618
process_one_work+0x9bf/0x16b0 kernel/workqueue.c:2297
worker_thread+0x658/0x11f0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
----------------
Code disassembly (best guess):
0: 74 f2 je 0xfffffff4
2: 48 89 c2 mov %rax,%rdx
5: b8 01 00 00 00 mov $0x1,%eax
a: 48 85 d2 test %rdx,%rdx
d: 75 56 jne 0x65
f: 5b pop %rbx
10: 5d pop %rbp
11: 41 5c pop %r12
13: c3 retq
14: 48 85 d2 test %rdx,%rdx
17: 74 5e je 0x77
19: 48 01 ea add %rbp,%rdx
1c: eb 09 jmp 0x27
1e: 48 83 c0 01 add $0x1,%rax
22: 48 39 d0 cmp %rdx,%rax
25: 74 50 je 0x77
27: 80 38 00 cmpb $0x0,(%rax)
* 2a: 74 f2 je 0x1e <-- trapping instruction
2c: eb d4 jmp 0x2
2e: 41 bc 08 00 00 00 mov $0x8,%r12d
34: 48 89 ea mov %rbp,%rdx
37: 45 29 dc sub %r11d,%r12d
3a: 4d 8d 1c 2c lea (%r12,%rbp,1),%r11
3e: eb 0c jmp 0x4c


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

Pavel Begunkov

unread,
Oct 21, 2021, 7:48:15 PM10/21/21
to syzbot, ax...@kernel.dk, io-u...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On 10/21/21 22:10, syzbot wrote:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: d999ade1cc86 Merge tag 'perf-tools-fixes-for-v5.15-2021-10..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=136f87d0b00000
> kernel config: https://syzkaller.appspot.com/x/.config?x=bab9d35f204746a7
> dashboard link: https://syzkaller.appspot.com/bug?extid=27d62ee6f256b186883e
> compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10d3f7ccb00000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15d3600cb00000
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+27d62e...@syzkaller.appspotmail.com

#syz test: git://git.kernel.dk/linux-block io_uring-5.15


--
Pavel Begunkov

syzbot

unread,
Oct 22, 2021, 12:38:10 AM10/22/21
to asml.s...@gmail.com, ax...@kernel.dk, io-u...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch but the reproducer is still triggering an issue:
INFO: task hung in io_wqe_worker

INFO: task iou-wrk-9392:9401 blocked for more than 143 seconds.
Not tainted 5.15.0-rc2-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:iou-wrk-9392 state:D stack:27952 pid: 9401 ppid: 7038 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xb44/0x5960 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857
do_wait_for_common kernel/sched/completion.c:85 [inline]
__wait_for_common kernel/sched/completion.c:106 [inline]
wait_for_common kernel/sched/completion.c:117 [inline]
wait_for_completion+0x176/0x280 kernel/sched/completion.c:138
io_worker_exit fs/io-wq.c:183 [inline]
io_wqe_worker+0x66d/0xc40 fs/io-wq.c:597
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

Showing all locks held in the system:
1 lock held by khungtaskd/27:
#0: ffffffff8b981ae0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6446
1 lock held by cron/6230:
#0: ffff8880b9c31a58 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x2b/0x120 kernel/sched/core.c:474
1 lock held by in:imklog/6237:
#0: ffff88801db6ad70 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:990

=============================================

NMI backtrace for cpu 1
CPU: 1 PID: 27 Comm: khungtaskd Not tainted 5.15.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
nmi_cpu_backtrace.cold+0x47/0x144 lib/nmi_backtrace.c:105
nmi_trigger_cpumask_backtrace+0x1ae/0x220 lib/nmi_backtrace.c:62
trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
check_hung_uninterruptible_tasks kernel/hung_task.c:210 [inline]
watchdog+0xc1d/0xf50 kernel/hung_task.c:295
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
Sending NMI from CPU 1 to CPUs 0:
NMI backtrace for cpu 0
CPU: 0 PID: 6872 Comm: kworker/0:4 Not tainted 5.15.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events nsim_dev_trap_report_work
RIP: 0010:set_canary_byte mm/kfence/core.c:214 [inline]
RIP: 0010:for_each_canary mm/kfence/core.c:249 [inline]
RIP: 0010:kfence_guarded_alloc mm/kfence/core.c:321 [inline]
RIP: 0010:__kfence_alloc+0x635/0xca0 mm/kfence/core.c:779
Code: 71 8a b8 ff 48 8b 6c 24 10 48 be 00 00 00 00 00 fc ff df 48 c1 ed 03 48 01 f5 4d 39 f7 73 5c e8 81 84 b8 ff 4c 89 f8 45 89 fe <4c> 89 fa 48 c1 e8 03 41 83 e6 07 83 e2 07 48 b9 00 00 00 00 00 fc
RSP: 0018:ffffc9000548fb48 EFLAGS: 00000093
RAX: ffff88823bce0d4b RBX: ffffffff9028ec08 RCX: 0000000000000000
RDX: ffff888079f7d580 RSI: ffffffff81be60df RDI: 0000000000000003
RBP: fffffbfff2051d8e R08: ffff88823bce0d4b R09: ffffffff8eef500f
R10: ffffffff81be6131 R11: 0000000000000001 R12: ffff8881441fa000
R13: 00000000000000e8 R14: 000000003bce0d4b R15: ffff88823bce0d4b
FS: 0000000000000000(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007efccb07f000 CR3: 000000000b68e000 CR4: 0000000000350ef0
Call Trace:
kfence_alloc include/linux/kfence.h:124 [inline]
slab_alloc_node mm/slub.c:3124 [inline]
kmem_cache_alloc_node+0x213/0x3d0 mm/slub.c:3242
__alloc_skb+0x20b/0x340 net/core/skbuff.c:414
alloc_skb include/linux/skbuff.h:1116 [inline]
nsim_dev_trap_skb_build drivers/net/netdevsim/dev.c:664 [inline]
nsim_dev_trap_report drivers/net/netdevsim/dev.c:721 [inline]
nsim_dev_trap_report_work+0x2ac/0xbd0 drivers/net/netdevsim/dev.c:762
process_one_work+0x9bf/0x16b0 kernel/workqueue.c:2297
worker_thread+0x658/0x11f0 kernel/workqueue.c:2444
kthread+0x3e5/0x4d0 kernel/kthread.c:319
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
----------------
Code disassembly (best guess):
0: 71 8a jno 0xffffff8c
2: b8 ff 48 8b 6c mov $0x6c8b48ff,%eax
7: 24 10 and $0x10,%al
9: 48 be 00 00 00 00 00 movabs $0xdffffc0000000000,%rsi
10: fc ff df
13: 48 c1 ed 03 shr $0x3,%rbp
17: 48 01 f5 add %rsi,%rbp
1a: 4d 39 f7 cmp %r14,%r15
1d: 73 5c jae 0x7b
1f: e8 81 84 b8 ff callq 0xffb884a5
24: 4c 89 f8 mov %r15,%rax
27: 45 89 fe mov %r15d,%r14d
* 2a: 4c 89 fa mov %r15,%rdx <-- trapping instruction
2d: 48 c1 e8 03 shr $0x3,%rax
31: 41 83 e6 07 and $0x7,%r14d
35: 83 e2 07 and $0x7,%edx
38: 48 rex.W
39: b9 00 00 00 00 mov $0x0,%ecx
3e: 00 fc add %bh,%ah


Tested on:

commit: b22fa62a io_uring: apply worker limits to previous users
git tree: git://git.kernel.dk/linux-block io_uring-5.15
console output: https://syzkaller.appspot.com/x/log.txt?x=16a3172cb00000
kernel config: https://syzkaller.appspot.com/x/.config?x=cf1d1005f4fd6ccb

Pavel Begunkov

unread,
Oct 22, 2021, 9:49:37 AM10/22/21
to syzbot, ax...@kernel.dk, io-u...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On 10/22/21 05:38, syzbot wrote:
> Hello,
>
> syzbot has tested the proposed patch but the reproducer is still triggering an issue:
> INFO: task hung in io_wqe_worker
>
> INFO: task iou-wrk-9392:9401 blocked for more than 143 seconds.
> Not tainted 5.15.0-rc2-syzkaller #0
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:iou-wrk-9392 state:D stack:27952 pid: 9401 ppid: 7038 flags:0x00004004
> Call Trace:
> context_switch kernel/sched/core.c:4940 [inline]
> __schedule+0xb44/0x5960 kernel/sched/core.c:6287
> schedule+0xd3/0x270 kernel/sched/core.c:6366
> schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857
> do_wait_for_common kernel/sched/completion.c:85 [inline]
> __wait_for_common kernel/sched/completion.c:106 [inline]
> wait_for_common kernel/sched/completion.c:117 [inline]
> wait_for_completion+0x176/0x280 kernel/sched/completion.c:138
> io_worker_exit fs/io-wq.c:183 [inline]
> io_wqe_worker+0x66d/0xc40 fs/io-wq.c:597
> ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

Easily reproducible, it's stuck in

static void io_worker_exit(struct io_worker *worker)
{
...
wait_for_completion(&worker->ref_done);
...
}

The reference belongs to a create_worker_cb() task_work item. It's expected
to either be executed or cancelled by io_wq_exit_workers(), but the owner
task never goes __io_uring_cancel (called in do_exit()) and so never
reaches io_wq_exit_workers().

Following the owner task, cat /proc/<pid>/stack:

[<0>] do_coredump+0x1d0/0x10e0
[<0>] get_signal+0x4a3/0x960
[<0>] arch_do_signal_or_restart+0xc3/0x6d0
[<0>] exit_to_user_mode_prepare+0x10e/0x190
[<0>] irqentry_exit_to_user_mode+0x9/0x20
[<0>] irqentry_exit+0x36/0x40
[<0>] exc_page_fault+0x95/0x190
[<0>] asm_exc_page_fault+0x1e/0x30

(gdb) l *(do_coredump+0x1d0-5)
0xffffffff81343ccb is in do_coredump (fs/coredump.c:469).
464
465 if (core_waiters > 0) {
466 struct core_thread *ptr;
467
468 freezer_do_not_count();
469 wait_for_completion(&core_state->startup);
470 freezer_count();

Can't say anything more at the moment as not familiar with coredump

--
Pavel Begunkov

Pavel Begunkov

unread,
Oct 22, 2021, 9:57:04 AM10/22/21
to syzbot, ax...@kernel.dk, io-u...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com
A simple hack allowing task works to be executed from there
workarounds the problem


diff --git a/fs/coredump.c b/fs/coredump.c
index 3224dee44d30..f6f9dfb02296 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -466,7 +466,8 @@ static int coredump_wait(int exit_code, struct core_state *core_state)
struct core_thread *ptr;

freezer_do_not_count();
- wait_for_completion(&core_state->startup);
+ while (wait_for_completion_interruptible(&core_state->startup))
+ tracehook_notify_signal();
freezer_count();
/*
* Wait for all the threads to become inactive, so that



--
Pavel Begunkov

Pavel Begunkov

unread,
Oct 28, 2021, 4:32:22 PM10/28/21
to syzbot, ax...@kernel.dk, io-u...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On 10/22/21 14:57, Pavel Begunkov wrote:
> On 10/22/21 14:49, Pavel Begunkov wrote:
>> On 10/22/21 05:38, syzbot wrote:
>>> Hello,
>>>
>>> syzbot has tested the proposed patch but the reproducer is still triggering an issue:
>>> INFO: task hung in io_wqe_worker
>>>
>>> INFO: task iou-wrk-9392:9401 blocked for more than 143 seconds.
>>>        Not tainted 5.15.0-rc2-syzkaller #0
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> task:iou-wrk-9392    state:D stack:27952 pid: 9401 ppid:  7038 flags:0x00004004
>>> Call Trace:
>>>   context_switch kernel/sched/core.c:4940 [inline]
>>>   __schedule+0xb44/0x5960 kernel/sched/core.c:6287
>>>   schedule+0xd3/0x270 kernel/sched/core.c:6366
>>>   schedule_timeout+0x1db/0x2a0 kernel/time/timer.c:1857
>>>   do_wait_for_common kernel/sched/completion.c:85 [inline]
>>>   __wait_for_common kernel/sched/completion.c:106 [inline]
>>>   wait_for_common kernel/sched/completion.c:117 [inline]
>>>   wait_for_completion+0x176/0x280 kernel/sched/completion.c:138
>>>   io_worker_exit fs/io-wq.c:183 [inline]
>>>   io_wqe_worker+0x66d/0xc40 fs/io-wq.c:597
>>>   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

#syz test: https://github.com/isilence/linux.git syz_coredump

syzbot

unread,
Oct 28, 2021, 6:35:12 PM10/28/21
to asml.s...@gmail.com, ax...@kernel.dk, io-u...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-and-tested-by: syzbot+27d62e...@syzkaller.appspotmail.com

Tested on:

commit: 5983fb88 io-wq: remove worker to owner dependency
git tree: https://github.com/isilence/linux.git syz_coredump
kernel config: https://syzkaller.appspot.com/x/.config?x=1f7f46d98a0da80e
dashboard link: https://syzkaller.appspot.com/bug?extid=27d62ee6f256b186883e
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2

Note: testing is done by a robot and is best-effort only.
Reply all
Reply to author
Forward
0 new messages