BUG: workqueue lockup (2)

4,066 views
Skip to first unread message

syzbot

unread,
Dec 3, 2017, 9:31:04 AM12/3/17
to gre...@linuxfoundation.org, kste...@linuxfoundation.org, linux-...@vger.kernel.org, linu...@kvack.org, pombr...@nexb.com, syzkall...@googlegroups.com, tg...@linutronix.de
Hello,

syzkaller hit the following crash on
2db767d9889cef087149a5eaa35c1497671fa40f
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
compiler: gcc (GCC) 7.1.1 20170620
.config is attached
Raw console output is attached.

Unfortunately, I don't have any reproducer for this bug yet.


BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 48s!
BUG: workqueue lockup - pool cpus=0-1 flags=0x4 nice=0 stuck for 47s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
pending: perf_sched_delayed, vmstat_shepherd,
jump_label_update_timeout, cache_reap
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
pending: neigh_periodic_work, neigh_periodic_work, do_cache_clean,
reg_check_chans_work
workqueue mm_percpu_wq: flags=0x8
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
pending: vmstat_update
workqueue writeback: flags=0x4e
pwq 4: cpus=0-1 flags=0x4 nice=0 active=1/256
in-flight: 3401:wb_workfn
workqueue kblockd: flags=0x18
pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256
pending: blk_mq_timeout_work
pool 4: cpus=0-1 flags=0x4 nice=0 hung=0s workers=11 idle: 3423 4249 92 21
3549 34 4803 5 4243 3414
audit: type=1326 audit(1512291140.021:615): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=7002 comm="syz-executor7"
exe="/root/syz-executor7" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291140.044:616): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=7002 comm="syz-executor7"
exe="/root/syz-executor7" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291140.045:617): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=7002 comm="syz-executor7"
exe="/root/syz-executor7" sig=0 arch=c000003e syscall=55 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291140.045:618): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=7002 comm="syz-executor7"
exe="/root/syz-executor7" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291140.045:619): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=7002 comm="syz-executor7"
exe="/root/syz-executor7" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291140.047:620): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=7002 comm="syz-executor7"
exe="/root/syz-executor7" sig=0 arch=c000003e syscall=257 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291140.047:621): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=7002 comm="syz-executor7"
exe="/root/syz-executor7" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291140.047:622): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=7002 comm="syz-executor7"
exe="/root/syz-executor7" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291140.048:623): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=7002 comm="syz-executor7"
exe="/root/syz-executor7" sig=0 arch=c000003e syscall=16 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291140.049:624): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=7002 comm="syz-executor7"
exe="/root/syz-executor7" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
netlink: 2 bytes leftover after parsing attributes in process
`syz-executor2'.
netlink: 2 bytes leftover after parsing attributes in process
`syz-executor2'.
netlink: 17 bytes leftover after parsing attributes in process
`syz-executor7'.
device gre0 entered promiscuous mode
device gre0 entered promiscuous mode
could not allocate digest TFM handle [vmnet1%
could not allocate digest TFM handle [vmnet1%
SELinux: unrecognized netlink message: protocol=0 nlmsg_type=7
sclass=netlink_route_socket pig=7627 comm=syz-executor3
sock: sock_set_timeout: `syz-executor6' (pid 7625) tries to set negative
timeout
Can not set IPV6_FL_F_REFLECT if flowlabel_consistency sysctl is enable
SELinux: unrecognized netlink message: protocol=0 nlmsg_type=0
sclass=netlink_route_socket pig=7648 comm=syz-executor3
sock: sock_set_timeout: `syz-executor6' (pid 7625) tries to set negative
timeout
Can not set IPV6_FL_F_REFLECT if flowlabel_consistency sysctl is enable
SELinux: unrecognized netlink message: protocol=0 nlmsg_type=7
sclass=netlink_route_socket pig=7648 comm=syz-executor3
SELinux: unrecognized netlink message: protocol=0 nlmsg_type=0
sclass=netlink_route_socket pig=7627 comm=syz-executor3
Can not set IPV6_FL_F_REFLECT if flowlabel_consistency sysctl is enable
Can not set IPV6_FL_F_REFLECT if flowlabel_consistency sysctl is enable
netlink: 4 bytes leftover after parsing attributes in process
`syz-executor3'.
netlink: 4 bytes leftover after parsing attributes in process
`syz-executor3'.
netlink: 1 bytes leftover after parsing attributes in process
`syz-executor3'.
QAT: Invalid ioctl
netlink: 1 bytes leftover after parsing attributes in process
`syz-executor3'.
QAT: Invalid ioctl
FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 1
CPU: 1 PID: 7838 Comm: syz-executor4 Not tainted 4.15.0-rc1+ #205
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
fail_dump lib/fault-inject.c:51 [inline]
should_fail+0x8c0/0xa40 lib/fault-inject.c:149
should_failslab+0xec/0x120 mm/failslab.c:32
slab_pre_alloc_hook mm/slab.h:421 [inline]
slab_alloc mm/slab.c:3371 [inline]
__do_kmalloc mm/slab.c:3709 [inline]
__kmalloc_track_caller+0x5f/0x760 mm/slab.c:3726
memdup_user+0x2c/0x90 mm/util.c:164
msr_io+0xec/0x3b0 arch/x86/kvm/x86.c:2650
kvm_arch_vcpu_ioctl+0x31d/0x4710 arch/x86/kvm/x86.c:3566
kvm_vcpu_ioctl+0x240/0x1010 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2726
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:686
SYSC_ioctl fs/ioctl.c:701 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x4529d9
RSP: 002b:00007fd7722d4c58 EFLAGS: 00000212 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fd7722d4aa0 RCX: 00000000004529d9
RDX: 0000000020002000 RSI: 000000004008ae89 RDI: 0000000000000016
RBP: 00007fd7722d4a90 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000212 R12: 00000000004b759b
R13: 00007fd7722d4bc8 R14: 00000000004b759b R15: 0000000000000000
device lo left promiscuous mode
device lo entered promiscuous mode
device lo left promiscuous mode
kvm_hv_set_msr: 127 callbacks suppressed
kvm [8010]: vcpu0, guest rIP: 0x9112 Hyper-V uhandled wrmsr: 0x4000008e
data 0x47
kvm [8010]: vcpu0, guest rIP: 0x9112 Hyper-V uhandled wrmsr: 0x4000008c
data 0x47
kvm [8010]: vcpu0, guest rIP: 0x9112 Hyper-V uhandled wrmsr: 0x4000008a
data 0x47
kvm [8010]: vcpu0, guest rIP: 0x9112 Hyper-V uhandled wrmsr: 0x40000088
data 0x47
kvm [8010]: vcpu0, guest rIP: 0x9112 Hyper-V uhandled wrmsr: 0x40000086
data 0x47
program syz-executor2 is using a deprecated SCSI ioctl, please convert it
to SG_IO
sd 0:0:1:0: ioctl_internal_command: ILLEGAL REQUEST asc=0x20 ascq=0x0
program syz-executor2 is using a deprecated SCSI ioctl, please convert it
to SG_IO
sd 0:0:1:0: ioctl_internal_command: ILLEGAL REQUEST asc=0x20 ascq=0x0
netlink: 2 bytes leftover after parsing attributes in process
`syz-executor5'.
program syz-executor2 is using a deprecated SCSI ioctl, please convert it
to SG_IO
sd 0:0:1:0: ioctl_internal_command: ILLEGAL REQUEST asc=0x20 ascq=0x0
netlink: 2 bytes leftover after parsing attributes in process
`syz-executor5'.
program syz-executor2 is using a deprecated SCSI ioctl, please convert it
to SG_IO
sd 0:0:1:0: ioctl_internal_command: ILLEGAL REQUEST asc=0x20 ascq=0x0
kauditd_printk_skb: 264 callbacks suppressed
audit: type=1326 audit(1512291148.643:889): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=8160 comm="syz-executor5"
exe="/root/syz-executor5" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291148.643:890): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=8160 comm="syz-executor5"
exe="/root/syz-executor5" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291148.643:891): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=8160 comm="syz-executor5"
exe="/root/syz-executor5" sig=0 arch=c000003e syscall=22 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291148.650:892): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=8160 comm="syz-executor5"
exe="/root/syz-executor5" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291148.650:893): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=8160 comm="syz-executor5"
exe="/root/syz-executor5" sig=0 arch=c000003e syscall=54 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291148.650:894): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=8160 comm="syz-executor5"
exe="/root/syz-executor5" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291148.650:895): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=8160 comm="syz-executor5"
exe="/root/syz-executor5" sig=0 arch=c000003e syscall=298 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291148.672:896): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=8160 comm="syz-executor5"
exe="/root/syz-executor5" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
audit: type=1326 audit(1512291148.705:897): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=8160 comm="syz-executor5"
exe="/root/syz-executor5" sig=0 arch=c000003e syscall=2 compat=0
ip=0x40cd11 code=0x7ffc0000
audit: type=1326 audit(1512291148.705:898): auid=4294967295 uid=0 gid=0
ses=4294967295 subj=kernel pid=8160 comm="syz-executor5"
exe="/root/syz-executor5" sig=0 arch=c000003e syscall=202 compat=0
ip=0x4529d9 code=0x7ffc0000
QAT: Invalid ioctl
QAT: Invalid ioctl


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzk...@googlegroups.com.
Please credit me with: Reported-by: syzbot <syzk...@googlegroups.com>

syzbot will keep track of this bug report.
Once a fix for this bug is committed, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug
report.
Note: all commands must start from beginning of the line in the email body.
config.txt
raw.log

Dmitry Vyukov

unread,
Dec 3, 2017, 9:36:55 AM12/3/17
to syzbot, Greg Kroah-Hartman, Kate Stewart, LKML, Linux-MM, Philippe Ombredanne, syzkall...@googlegroups.com, Thomas Gleixner
On Sun, Dec 3, 2017 at 3:31 PM, syzbot
<bot+e38be687a2450270a3...@syzkaller.appspotmail.com>
wrote:
This error report does not look actionable. Perhaps if code that
detect it would dump cpu/task stacks, it would be actionable.
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bug...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/94eb2c03c9bc75aff2055f70734c%40google.com.
> For more options, visit https://groups.google.com/d/optout.

Thomas Gleixner

unread,
Dec 3, 2017, 9:48:20 AM12/3/17
to Dmitry Vyukov, syzbot, Greg Kroah-Hartman, Kate Stewart, LKML, Linux-MM, Philippe Ombredanne, syzkall...@googlegroups.com
That might be related to the RCU stall issue we are chasing, where a timer
does not fire for yet unknown reasons. We have a reproducer now and
hopefully a solution in the next days.

Thanks,

tglx

Tetsuo Handa

unread,
Dec 4, 2017, 6:08:29 AM12/4/17
to Thomas Gleixner, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, Kate Stewart, LKML, Linux-MM, Philippe Ombredanne, syzkall...@googlegroups.com
Can you tell me where "the RCU stall issue" is discussed at? According to my
stress tests, wb_workfn is in-flight and other work items remain pending is a
possible sign of OOM lockup that wb_workfn is unable to invoke the OOM killer
(due to GFP_NOFS allocation request like an example shown below).

[ 162.810797] kworker/u16:27: page allocation stalls for 10001ms, order:0, mode:0x1400040(GFP_NOFS), nodemask=(null)
[ 162.810805] kworker/u16:27 cpuset=/ mems_allowed=0
[ 162.810812] CPU: 2 PID: 354 Comm: kworker/u16:27 Not tainted 4.12.0-next-20170713+ #629
[ 162.810813] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 162.810819] Workqueue: writeback wb_workfn (flush-8:0)
[ 162.810822] Call Trace:
[ 162.810829] dump_stack+0x67/0x9e
[ 162.810835] warn_alloc+0x10f/0x1b0
[ 162.810843] ? wake_all_kswapds+0x56/0x96
[ 162.810850] __alloc_pages_nodemask+0xabd/0xeb0
[ 162.810873] alloc_pages_current+0x65/0xb0
[ 162.810879] xfs_buf_allocate_memory+0x15b/0x298
[ 162.810886] xfs_buf_get_map+0xf4/0x150
[ 162.810893] xfs_buf_read_map+0x29/0xd0
[ 162.810900] xfs_trans_read_buf_map+0x9a/0x1a0
[ 162.810906] xfs_btree_read_buf_block.constprop.35+0x73/0xc0
[ 162.810915] xfs_btree_lookup_get_block+0x83/0x160
[ 162.810922] xfs_btree_lookup+0xcb/0x3b0
[ 162.810930] ? xfs_allocbt_init_cursor+0x3c/0xe0
[ 162.810936] xfs_alloc_ag_vextent_near+0x216/0x840
[ 162.810949] xfs_alloc_ag_vextent+0x137/0x150
[ 162.810952] xfs_alloc_vextent+0x2ff/0x370
[ 162.810958] xfs_bmap_btalloc+0x211/0x760
[ 162.810980] xfs_bmap_alloc+0x9/0x10
[ 162.810983] xfs_bmapi_write+0x618/0xc00
[ 162.811015] xfs_iomap_write_allocate+0x18e/0x390
[ 162.811034] xfs_map_blocks+0x160/0x170
[ 162.811042] xfs_do_writepage+0x1b9/0x6b0
[ 162.811056] write_cache_pages+0x1f6/0x490
[ 162.811061] ? xfs_aops_discard_page+0x130/0x130
[ 162.811079] xfs_vm_writepages+0x66/0xa0
[ 162.811088] do_writepages+0x17/0x80
[ 162.811092] __writeback_single_inode+0x33/0x170
[ 162.811097] writeback_sb_inodes+0x2cb/0x5e0
[ 162.811116] __writeback_inodes_wb+0x87/0xc0
[ 162.811122] wb_writeback+0x1d9/0x210
[ 162.811135] wb_workfn+0x1a2/0x260
[ 162.811148] process_one_work+0x1d0/0x3e0
[ 162.811150] ? process_one_work+0x16a/0x3e0
[ 162.811159] worker_thread+0x48/0x3c0
[ 162.811169] kthread+0x10d/0x140
[ 162.811170] ? process_one_work+0x3e0/0x3e0
[ 162.811173] ? kthread_create_on_node+0x60/0x60
[ 162.811179] ret_from_fork+0x27/0x40

syzbot

unread,
Dec 19, 2017, 7:25:03 AM12/19/17
to dvy...@google.com, gre...@linuxfoundation.org, kste...@linuxfoundation.org, linux-...@vger.kernel.org, linu...@kvack.org, penguin...@i-love.sakura.ne.jp, pombr...@nexb.com, syzkall...@googlegroups.com, tg...@linutronix.de
syzkaller has found reproducer for the following crash on
f3b5ad89de16f5d42e8ad36fbdf85f705c1ae051
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
compiler: gcc (GCC) 7.1.1 20170620
.config is attached
Raw console output is attached.
C reproducer is attached
syzkaller reproducer is attached. See https://goo.gl/kgGztJ
for information about syzkaller reproducers


BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 37s!
BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=-20 stuck for 32s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
pending: cache_reap
workqueue events_power_efficient: flags=0x80
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
pending: neigh_periodic_work, do_cache_clean
workqueue mm_percpu_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
pending: vmstat_update
workqueue kblockd: flags=0x18
pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256
pending: blk_timeout_work

config.txt
raw.log
repro.txt
repro.c

Tetsuo Handa

unread,
Dec 19, 2017, 9:27:30 AM12/19/17
to bot+e38be687a2450270a3...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, dvy...@google.com, gre...@linuxfoundation.org, kste...@linuxfoundation.org, linux-...@vger.kernel.org, linu...@kvack.org, pombr...@nexb.com, tg...@linutronix.de
syzbot wrote:
>
> syzkaller has found reproducer for the following crash on
> f3b5ad89de16f5d42e8ad36fbdf85f705c1ae051

"BUG: workqueue lockup" is not a crash.
You gave up too early. There is no hint for understanding what was going on.
While we can observe "BUG: workqueue lockup" under memory pressure, there is
no hint like SysRq-t and SysRq-m. Thus, I can't tell something is wrong.

At least you need to confirm that lockup lasts for a few minutes. Otherwise,
this might be just overstressing. (According to repro.c , 12 threads are
created and soon SEGV follows? According to above message, only 2 CPUs?
Triggering SEGV suggests memory was low due to saving coredump?)

Dmitry Vyukov

unread,
Dec 19, 2017, 9:41:05 AM12/19/17
to Tetsuo Handa, syzbot, syzkall...@googlegroups.com, Greg Kroah-Hartman, Kate Stewart, LKML, Linux-MM, Philippe Ombredanne, Thomas Gleixner
On Tue, Dec 19, 2017 at 3:27 PM, Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
> syzbot wrote:
>>
>> syzkaller has found reproducer for the following crash on
>> f3b5ad89de16f5d42e8ad36fbdf85f705c1ae051
>
> "BUG: workqueue lockup" is not a crash.

Hi Tetsuo,

What is the proper name for all of these collectively?
Do you know how to send them programmatically? I tried to find a way
several times, but failed. Articles that I've found talk about
pressing some keys that don't translate directly to us-ascii.

But you can also run the reproducer. No report can possible provide
all possible useful information, sometimes debugging boils down to
manually adding printfs. That's why syzbot aims at providing a
reproducer as the ultimate source of details. Also since a developer
needs to test a proposed fix, it's easier to start with the reproducer
right away.


> At least you need to confirm that lockup lasts for a few minutes. Otherwise,

Is it possible to increase the timeout? How? We could bump it up to 2 minutes.

Dmitry Vyukov

unread,
Dec 19, 2017, 11:38:10 AM12/19/17
to Tetsuo Handa, syzbot, syzkall...@googlegroups.com, Greg Kroah-Hartman, Kate Stewart, LKML, Linux-MM, Philippe Ombredanne, Thomas Gleixner
On second though, some oopses automatically dump locks/tasks. Should
we do the same for this oops?

Tetsuo Handa

unread,
Dec 20, 2017, 5:55:17 AM12/20/17
to dvy...@google.com, bot+e38be687a2450270a3...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, gre...@linuxfoundation.org, kste...@linuxfoundation.org, linux-...@vger.kernel.org, linu...@kvack.org, pombr...@nexb.com, tg...@linutronix.de
Dmitry Vyukov wrote:
> On Tue, Dec 19, 2017 at 3:27 PM, Tetsuo Handa
> <penguin...@i-love.sakura.ne.jp> wrote:
> > syzbot wrote:
> >>
> >> syzkaller has found reproducer for the following crash on
> >> f3b5ad89de16f5d42e8ad36fbdf85f705c1ae051
> >
> > "BUG: workqueue lockup" is not a crash.
>
> Hi Tetsuo,
>
> What is the proper name for all of these collectively?

I think that things which lead to kernel panic when /proc/sys/kernel/panic_on_oops
was set to 1 are called an "oops" (or a "kerneloops").

Speak of "BUG: workqueue lockup", this is not an "oops". This message was
added by 82607adcf9cdf40f ("workqueue: implement lockup detector"), and
this message does not always indicate a fatal problem. This message can be
printed when the system is really out of CPU and memory. As far as I tested,
I think that workqueue was not able to run on specific CPU due to a soft
lockup bug.
# echo t > /proc/sysrq-trigger
# echo m > /proc/sysrq-trigger

>
> But you can also run the reproducer. No report can possible provide
> all possible useful information, sometimes debugging boils down to
> manually adding printfs. That's why syzbot aims at providing a
> reproducer as the ultimate source of details. Also since a developer
> needs to test a proposed fix, it's easier to start with the reproducer
> right away.

I don't have information about how to run the reproducer (e.g. how many
CPUs, how much memory, what network configuration is needed).

Also, please explain how to interpret raw.log file. The raw.log in
94eb2c03c9bc75...@google.com had a lot of code output and kernel
messages but did not contain "BUG: workqueue lockup" message. On the other
hand, the raw.log in 001a113f711a52...@google.com has only kernel
messages and contains "BUG: workqueue lockup" message. Why they are
significantly different?

Also, can you add timestamp to all messages?
When each message was printed is a clue for understanding relationship.

>
>
> > At least you need to confirm that lockup lasts for a few minutes. Otherwise,
>
> Is it possible to increase the timeout? How? We could bump it up to 2 minutes.

# echo 120 > /sys/module/workqueue/parameters/watchdog_thresh

But generally, reporting multiple times rather than only once gives me
better clue, for the former would tell me whether situation was changing.

Can you try not to give up as soon as "BUG: workqueue lockup" was printed
for the first time?

Dmitry Vyukov

unread,
Dec 21, 2017, 5:19:28 AM12/21/17
to Tetsuo Handa, syzbot, syzkall...@googlegroups.com, Greg Kroah-Hartman, Kate Stewart, LKML, Linux-MM, Philippe Ombredanne, Thomas Gleixner
On Wed, Dec 20, 2017 at 11:55 AM, Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
> Dmitry Vyukov wrote:
>> On Tue, Dec 19, 2017 at 3:27 PM, Tetsuo Handa
>> <penguin...@i-love.sakura.ne.jp> wrote:
>> > syzbot wrote:
>> >>
>> >> syzkaller has found reproducer for the following crash on
>> >> f3b5ad89de16f5d42e8ad36fbdf85f705c1ae051
>> >
>> > "BUG: workqueue lockup" is not a crash.
>>
>> Hi Tetsuo,
>>
>> What is the proper name for all of these collectively?
>
> I think that things which lead to kernel panic when /proc/sys/kernel/panic_on_oops
> was set to 1 are called an "oops" (or a "kerneloops").
>
> Speak of "BUG: workqueue lockup", this is not an "oops". This message was
> added by 82607adcf9cdf40f ("workqueue: implement lockup detector"), and
> this message does not always indicate a fatal problem. This message can be
> printed when the system is really out of CPU and memory. As far as I tested,
> I think that workqueue was not able to run on specific CPU due to a soft
> lockup bug.


There are also warnings which don't panic normally, unless
panic_on_warn is set. There are also cases when we suddenly lost a
machine and have no idea what happened with it. And also cases when we
are kind-a connected, and nothing bad is printed on console, but it's
still un-operable.
The only collective name I can think of is bug. We could change it to
bug. Otherwise since there are multiple names, I don't think it's
worth spending more time on this.

Dmitry Vyukov

unread,
Dec 21, 2017, 5:23:06 AM12/21/17
to Tetsuo Handa, syzbot, syzkall...@googlegroups.com, Greg Kroah-Hartman, Kate Stewart, LKML, Linux-MM, Philippe Ombredanne, Thomas Gleixner
On Wed, Dec 20, 2017 at 11:55 AM, Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
This requires working ssh connection, but we routinely deal with
half-dead kernels. I think that sysrq over console is as reliable as
we can get in this context. But I don't know how to send them.

But thinking more about this, I am leaning towards the direction that
kernel just need to do the right thing and print that info.
In lots of cases we get a panic and as far as I understand kernel
won't react on sysrq in that state. Console is still unreliable too.
If a message is not useful, the right direction is to make it useful.

Dmitry Vyukov

unread,
Dec 21, 2017, 6:04:31 AM12/21/17
to Tetsuo Handa, syzbot, syzkall...@googlegroups.com, Greg Kroah-Hartman, Kate Stewart, LKML, Linux-MM, Philippe Ombredanne, Thomas Gleixner
On Wed, Dec 20, 2017 at 11:55 AM, Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
Usually all of that is irrelevant and these reproduce well on any machine.
FWIW, there were 2 CPUs and 2 GBs of memory. Network -- whatever GCE
provides as default network.


> Also, please explain how to interpret raw.log file. The raw.log in
> 94eb2c03c9bc75...@google.com had a lot of code output and kernel
> messages but did not contain "BUG: workqueue lockup" message. On the other
> hand, the raw.log in 001a113f711a52...@google.com has only kernel
> messages and contains "BUG: workqueue lockup" message. Why they are
> significantly different?


The first raw.log does contain "BUG: workqueue lockup", I see it right there:

[ 120.799119] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0
nice=0 stuck for 48s!
[ 120.807313] BUG: workqueue lockup - pool cpus=0-1 flags=0x4 nice=0
stuck for 47s!
[ 120.815024] Showing busy workqueues and worker pools:
[ 120.820369] workqueue events: flags=0x0
[ 120.824536] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[ 120.830803] pending: perf_sched_delayed, vmstat_shepherd,
jump_label_update_timeout, cache_reap
[ 120.840149] workqueue events_power_efficient: flags=0x80
[ 120.845651] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[ 120.851822] pending: neigh_periodic_work, neigh_periodic_work,
do_cache_clean, reg_check_chans_work
[ 120.861447] workqueue mm_percpu_wq: flags=0x8
[ 120.865947] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 120.872082] pending: vmstat_update
[ 120.875994] workqueue writeback: flags=0x4e
[ 120.880416] pwq 4: cpus=0-1 flags=0x4 nice=0 active=1/256
[ 120.886164] in-flight: 3401:wb_workfn
[ 120.890358] workqueue kblockd: flags=0x18

The difference is cause by the fact that the first one was obtained
from fuzzing session when fuzzer executed lots of random programs,
while the second one was an attempt to localize a reproducer, so the
system run programs one-by-one on freshly booted machines.



> Also, can you add timestamp to all messages?
> When each message was printed is a clue for understanding relationship.

There are timestamps. each program is prefixed with timestamps:

2017/12/03 08:51:30 executing program 6:

these things allow to tie kernel and real time:

[ 71.240837] QAT: Invalid ioctl
2017/12/03 08:51:30 executing program 3:



>> > At least you need to confirm that lockup lasts for a few minutes. Otherwise,
>>
>> Is it possible to increase the timeout? How? We could bump it up to 2 minutes.
>
> # echo 120 > /sys/module/workqueue/parameters/watchdog_thresh
>
> But generally, reporting multiple times rather than only once gives me
> better clue, for the former would tell me whether situation was changing.
>
> Can you try not to give up as soon as "BUG: workqueue lockup" was printed
> for the first time?


I've bumped timeout to 120 seconds with workqueue.watchdog_thresh=120
command line arg. Let's see if it still leaves any false positives, I
think 2 minutes should be enough, a CPU stalled for 2+ minutes
suggests something to fix anyway(even if just slowness somewhere). And
in the end this wasn't a false positive either, right?
Not giving up after an oops message will be hard and problematic for
several reasons.

Tetsuo Handa

unread,
Dec 21, 2017, 8:07:47 AM12/21/17
to dvy...@google.com, bot+e38be687a2450270a3...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, gre...@linuxfoundation.org, kste...@linuxfoundation.org, linux-...@vger.kernel.org, linu...@kvack.org, pombr...@nexb.com, tg...@linutronix.de
Dmitry Vyukov wrote:
> On Wed, Dec 20, 2017 at 11:55 AM, Tetsuo Handa
> <penguin...@i-love.sakura.ne.jp> wrote:
> > Dmitry Vyukov wrote:
> >> On Tue, Dec 19, 2017 at 3:27 PM, Tetsuo Handa
> >> <penguin...@i-love.sakura.ne.jp> wrote:
> >> > syzbot wrote:
> >> >>
> >> >> syzkaller has found reproducer for the following crash on
> >> >> f3b5ad89de16f5d42e8ad36fbdf85f705c1ae051
> >> >
> >> > "BUG: workqueue lockup" is not a crash.
> >>
> >> Hi Tetsuo,
> >>
> >> What is the proper name for all of these collectively?
> >
> > I think that things which lead to kernel panic when /proc/sys/kernel/panic_on_oops
> > was set to 1 are called an "oops" (or a "kerneloops").
> >
> > Speak of "BUG: workqueue lockup", this is not an "oops". This message was
> > added by 82607adcf9cdf40f ("workqueue: implement lockup detector"), and
> > this message does not always indicate a fatal problem. This message can be
> > printed when the system is really out of CPU and memory. As far as I tested,
> > I think that workqueue was not able to run on specific CPU due to a soft
> > lockup bug.
>
> There are also warnings which don't panic normally, unless
> panic_on_warn is set. There are also cases when we suddenly lost a
> machine and have no idea what happened with it. And also cases when we
> are kind-a connected, and nothing bad is printed on console, but it's
> still un-operable.

Configuring netconsole might be helpful, for I use udplogger at
https://osdn.net/projects/akari/scm/svn/tree/head/branches/udplogger/
in order to collect all messages (not only kernel messages but also
any text messages which can be sent as UDP packets) with timestamp added.

An example of timestamp added to each line is
http://I-love.SAKURA.ne.jp/tmp/20171018-deflate.log.xz .

You can combine kernel messages from netconsole and output from shell
session using bash's

$ (command1; command2; command3) > /dev/udp/$remote_ip/$remote_port

syntax.

> The only collective name I can think of is bug. We could change it to
> bug. Otherwise since there are multiple names, I don't think it's
> worth spending more time on this.

What I care is whether the report is useful.

>
> >> >
> >> > You gave up too early. There is no hint for understanding what was going on.
> >> > While we can observe "BUG: workqueue lockup" under memory pressure, there is
> >> > no hint like SysRq-t and SysRq-m. Thus, I can't tell something is wrong.
> >>
> >> Do you know how to send them programmatically? I tried to find a way
> >> several times, but failed. Articles that I've found talk about
> >> pressing some keys that don't translate directly to us-ascii.
> >
> > # echo t > /proc/sysrq-trigger
> > # echo m > /proc/sysrq-trigger
>
>
> This requires working ssh connection, but we routinely deal with
> half-dead kernels. I think that sysrq over console is as reliable as
> we can get in this context. But I don't know how to send them.

I can't understand your question. If the machine is running in a
virtualized environment, doesn't hypervisor provide a mean to send
SysRq commands to a guest remotely (e.g. "virsh send-keys sysrq") ?

If no means available, running

----------
#/bin/sh

while :
do
echo t > /proc/sysrq-trigger
echo m > /proc/sysrq-trigger
sleep 60
done
----------

in the background might be used.


>
> But thinking more about this, I am leaning towards the direction that
> kernel just need to do the right thing and print that info.
> In lots of cases we get a panic and as far as I understand kernel
> won't react on sysrq in that state. Console is still unreliable too.
> If a message is not useful, the right direction is to make it useful.
>

Then, configure kdump and analyze the vmcore. Kernel panic message
alone is not so helpful. You can feed commands to crash utility from
stdin and save stdout to a file. Then, the result file will provide
more information than SysRq-t + SysRq-m (apart from lack of ability to
understand whether situation has changed over time).

> >>
> >> But you can also run the reproducer. No report can possible provide
> >> all possible useful information, sometimes debugging boils down to
> >> manually adding printfs. That's why syzbot aims at providing a
> >> reproducer as the ultimate source of details. Also since a developer
> >> needs to test a proposed fix, it's easier to start with the reproducer
> >> right away.
> >
> > I don't have information about how to run the reproducer (e.g. how many
> > CPUs, how much memory, what network configuration is needed).
>
> Usually all of that is irrelevant and these reproduce well on any machine.
> FWIW, there were 2 CPUs and 2 GBs of memory. Network -- whatever GCE
> provides as default network.

The reproducer contained network addresses.
If the bug depends on network, how to configure network is important.
Where?

I'm talking about https://marc.info/?l=linux-mm&m=151231146619948&q=p4
at http://lkml.kernel.org/r/94eb2c03c9bc75...@google.com .

>
> The difference is cause by the fact that the first one was obtained
> from fuzzing session when fuzzer executed lots of random programs,
> while the second one was an attempt to localize a reproducer, so the
> system run programs one-by-one on freshly booted machines.
>

I see. But context is too limited to know that.

>
>
> > Also, can you add timestamp to all messages?
> > When each message was printed is a clue for understanding relationship.
>
> There are timestamps. each program is prefixed with timestamps:
>
> 2017/12/03 08:51:30 executing program 6:
>
> these things allow to tie kernel and real time:
>
> [ 71.240837] QAT: Invalid ioctl
> 2017/12/03 08:51:30 executing program 3:
>

What I want is something like

timestamp kernel message 1
timestamp kernel message 2
timestamp kernel message 3
timestamp shell session message 1
timestamp kernel message 4
timestamp kernel message 5
timestamp shell session message 2
timestamp shell session message 3
timestamp kernel message 6
timestamp kernel message 7

which can be done using udplogger above.

>
>
> >> > At least you need to confirm that lockup lasts for a few minutes. Otherwise,
> >>
> >> Is it possible to increase the timeout? How? We could bump it up to 2 minutes.
> >
> > # echo 120 > /sys/module/workqueue/parameters/watchdog_thresh
> >
> > But generally, reporting multiple times rather than only once gives me
> > better clue, for the former would tell me whether situation was changing.
> >
> > Can you try not to give up as soon as "BUG: workqueue lockup" was printed
> > for the first time?
>
>
> I've bumped timeout to 120 seconds with workqueue.watchdog_thresh=120
> command line arg. Let's see if it still leaves any false positives, I
> think 2 minutes should be enough, a CPU stalled for 2+ minutes
> suggests something to fix anyway(even if just slowness somewhere). And
> in the end this wasn't a false positive either, right?

Regarding this bug, the report should include soft lockups rather than
workqueue lockups, for workqueue was not able to run for long due to
soft lockup in progress.

> Not giving up after an oops message will be hard and problematic for
> several reasons.
>

But reports which cannot understand what was happening is not actionable.
Again, "BUG: workqueue lockup" is not an "oops".

Dmitry Vyukov

unread,
Dec 28, 2017, 8:43:28 AM12/28/17
to Tetsuo Handa, syzbot, syzkall...@googlegroups.com, Greg Kroah-Hartman, Kate Stewart, LKML, Linux-MM, Philippe Ombredanne, Thomas Gleixner
On Thu, Dec 21, 2017 at 2:07 PM, Tetsuo Handa
syzkaller already sends everything over network to a reliable host. So
this part is already working.



>> The only collective name I can think of is bug. We could change it to
>> bug. Otherwise since there are multiple names, I don't think it's
>> worth spending more time on this.
>
> What I care is whether the report is useful.
>
>>
>> >> >
>> >> > You gave up too early. There is no hint for understanding what was going on.
>> >> > While we can observe "BUG: workqueue lockup" under memory pressure, there is
>> >> > no hint like SysRq-t and SysRq-m. Thus, I can't tell something is wrong.
>> >>
>> >> Do you know how to send them programmatically? I tried to find a way
>> >> several times, but failed. Articles that I've found talk about
>> >> pressing some keys that don't translate directly to us-ascii.
>> >
>> > # echo t > /proc/sysrq-trigger
>> > # echo m > /proc/sysrq-trigger
>>
>>
>> This requires working ssh connection, but we routinely deal with
>> half-dead kernels. I think that sysrq over console is as reliable as
>> we can get in this context. But I don't know how to send them.
>
> I can't understand your question. If the machine is running in a
> virtualized environment, doesn't hypervisor provide a mean to send
> SysRq commands to a guest remotely (e.g. "virsh send-keys sysrq") ?

These particular machines were GCE instances. I can't find any info
about special GCE capabilities to send sysrqs.


> If no means available, running
>
> ----------
> #/bin/sh
>
> while :
> do
> echo t > /proc/sysrq-trigger
> echo m > /proc/sysrq-trigger
> sleep 60
> done
> ----------
>
> in the background might be used.

This has good chances of missing the interesting stacks. Thinking of
this more, I think kernel should dump that info on bugs. The current
"BUG: workqueue lockup" report is not actionable, it's not directly
related to syzbot, it's related to kernel.


>> But thinking more about this, I am leaning towards the direction that
>> kernel just need to do the right thing and print that info.
>> In lots of cases we get a panic and as far as I understand kernel
>> won't react on sysrq in that state. Console is still unreliable too.
>> If a message is not useful, the right direction is to make it useful.
>>
>
> Then, configure kdump and analyze the vmcore. Kernel panic message
> alone is not so helpful. You can feed commands to crash utility from
> stdin and save stdout to a file. Then, the result file will provide
> more information than SysRq-t + SysRq-m (apart from lack of ability to
> understand whether situation has changed over time).

I've filed https://github.com/google/syzkaller/issues/491 for kdump
cores. But there are lots to learn. And this also needs to be done not
once by an intelligent human, but programmed to work fully
automatically, which is usually much harder to do.
The general idea, is that the reproducer is the ultimate source of
details. kdump can well be not helpful as well. Lots of people won't
look at them at all for various reasons. Sometimes you need to add
additional printf's and re-run and then repeat this multiple times. I
don't think there a magical piece of information that will shed light
on just any kernel issue.


>> >> But you can also run the reproducer. No report can possible provide
>> >> all possible useful information, sometimes debugging boils down to
>> >> manually adding printfs. That's why syzbot aims at providing a
>> >> reproducer as the ultimate source of details. Also since a developer
>> >> needs to test a proposed fix, it's easier to start with the reproducer
>> >> right away.
>> >
>> > I don't have information about how to run the reproducer (e.g. how many
>> > CPUs, how much memory, what network configuration is needed).
>>
>> Usually all of that is irrelevant and these reproduce well on any machine.
>> FWIW, there were 2 CPUs and 2 GBs of memory. Network -- whatever GCE
>> provides as default network.
>
> The reproducer contained network addresses.
> If the bug depends on network, how to configure network is important.

Do you mean getsockopt$inet_sctp6_SCTP_GET_LOCAL_ADDRS call? But it
only obtains addresses and I think it fails, because it's called on a
local file. Generally, network communication of these programs is
self-contained. If they use network, they bring up interfaces.
There are lots of bits to full reproducibility. For example, you would
also need to use GCE VMs. As I said, in 95% of cases these are
reproducible without any special measures (.config obviously matters,
but it's supplied).
Interesting. Looks like LKML bug, the file is truncated half way. You
can see the full raw.log here:
https://groups.google.com/d/msg/syzkaller-bugs/vwcINLkXTVQ/fuzYSNeXAwAJ

I've tried to find LKML and kernel bugzilla admins, but can't find any
real people. If you know how to contact them, we can talk to them.



>> The difference is cause by the fact that the first one was obtained
>> from fuzzing session when fuzzer executed lots of random programs,
>> while the second one was an attempt to localize a reproducer, so the
>> system run programs one-by-one on freshly booted machines.
>>
>
> I see. But context is too limited to know that.

Yes. But there is also a problem of too much context. We have hard
time making some people read even the minimal amount of concentrated
information. Having a 100-page [outdated] manual won't be helpful
either, and as it usually happens these manuals tend to contain
everything but the bit of information you are actually looking for.
That's why I an answering questions.
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bug...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/201712212207.GHD30218.MtFFVSOOQLHFJO%40I-love.SAKURA.ne.jp.

manudu...@gmail.com

unread,
Apr 9, 2018, 11:06:26 PM4/9/18
to syzkaller-bugs
I'm hitting this bug daily three, four times a day. The computer will freeze only mouse will respond, it will take 5 minutes more or less to get responsive again.



[16846.160193] workqueue events_power_efficient: flags=0x80
[16846.160195]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[16846.160198]     pending: gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work, check_lifetime
[16846.160223] workqueue events_freezable_power_: flags=0x84
[16846.160224]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[16846.160227]     pending: disk_events_workfn
[16846.160236] workqueue mm_percpu_wq: flags=0x8
[16846.160238]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[16846.160240]     pending: drain_local_pages_wq BAR(1662)
[16876.228870] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 146s!
[16876.228879] Showing busy workqueues and worker pools:
[16876.228882] workqueue events: flags=0x0
[16876.228883]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=256/256
[16876.228885]     pending: amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], cache_reap, amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.229295] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.229660] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], vmstat_shepherd, amd_sched_job_finish [amdgpu]
[16876.230026] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.230390] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.230753] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.231116] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.231479] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.231842] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.232204] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.232489]     delayed: amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.232868] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.233233] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.233597] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.233963] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.234326] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.234689] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.235053] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.235418] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.235783] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.236147] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.236512] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.236876] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.237241] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.237610] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.237976] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.238349] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.238716] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.239084] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.239452] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.239820] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.240188] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.240561] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.240934] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.241301] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.241675] , amd_sched_job_finish [amdgpu], amd_sched_job_finish [amdgpu]
[16876.241699] workqueue events_power_efficient: flags=0x80
[16876.241701]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=4/256
[16876.241703]     pending: gc_worker [nf_conntrack], neigh_periodic_work, neigh_periodic_work, check_lifetime
[16876.241731] workqueue events_freezable_power_: flags=0x84
[16876.241732]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[16876.241734]     pending: disk_events_workfn
[16876.241747] workqueue mm_percpu_wq: flags=0x8
[16876.241748]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[16876.241750]     pending: drain_local_pages_wq BAR(1662)

Dmitry Vyukov

unread,
Apr 10, 2018, 4:58:19 AM4/10/18
to manudu...@gmail.com, syzkaller-bugs
On Tue, Apr 10, 2018 at 5:06 AM, <manudu...@gmail.com> wrote:
> I'm hitting this bug daily three, four times a day. The computer will freeze
> only mouse will respond, it will take 5 minutes more or less to get
> responsive again.

Hi,

Your issue looks unrelated to this bug, and related to the amdgpu module.
I don't see amd_sched_job_finish function in kernel. If it's
proprietary driver, you want to contact vendor of the driver.
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bug...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/90fb362b-f168-4fed-993d-f2383d4504c1%40googlegroups.com.

Eric Biggers

unread,
May 12, 2018, 5:50:18 PM5/12/18
to syzbot, dvy...@google.com, gre...@linuxfoundation.org, kste...@linuxfoundation.org, linux-...@vger.kernel.org, linu...@kvack.org, penguin...@i-love.sakura.ne.jp, pombr...@nexb.com, syzkall...@googlegroups.com, tg...@linutronix.de
The bug that this reproducer reproduces was fixed a while ago by commit
966031f340185e, so I'm marking this bug report fixed by it:

#syz fix: n_tty: fix EXTPROC vs ICANON interaction with TIOCINQ (aka FIONREAD)

Note that the error message was not always "BUG: workqueue lockup"; it was also
sometimes like "watchdog: BUG: soft lockup - CPU#5 stuck for 22s!".

syzbot still is hitting the "BUG: workqueue lockup" error sometimes, but it must
be for other reasons. None has a reproducer currently.

- Eric

Tetsuo Handa

unread,
May 12, 2018, 10:06:24 PM5/12/18
to ebig...@gmail.com, bot+e38be687a2450270a3...@syzkaller.appspotmail.com, pe...@hurleysoftware.com, dvy...@google.com, gre...@linuxfoundation.org, kste...@linuxfoundation.org, linux-...@vger.kernel.org, linu...@kvack.org, pombr...@nexb.com, syzkall...@googlegroups.com, tg...@linutronix.de
Eric Biggers wrote:
> The bug that this reproducer reproduces was fixed a while ago by commit
> 966031f340185e, so I'm marking this bug report fixed by it:
>
> #syz fix: n_tty: fix EXTPROC vs ICANON interaction with TIOCINQ (aka FIONREAD)

Nope. Commit 966031f340185edd ("n_tty: fix EXTPROC vs ICANON interaction with
TIOCINQ (aka FIONREAD)") is "Wed Dec 20 17:57:06 2017 -0800" but the last
occurrence on linux.git (commit 008464a9360e31b1 ("Merge branch 'for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid")) is only a few days ago
("Wed May 9 10:49:52 2018 -1000").

>
> Note that the error message was not always "BUG: workqueue lockup"; it was also
> sometimes like "watchdog: BUG: soft lockup - CPU#5 stuck for 22s!".
>
> syzbot still is hitting the "BUG: workqueue lockup" error sometimes, but it must
> be for other reasons. None has a reproducer currently.

The last occurrence on linux.git is considered as a duplicate of

[upstream] INFO: rcu detected stall in n_tty_receive_char_special
https://syzkaller.appspot.com/bug?id=3d7481a346958d9469bebbeb0537d5f056bdd6e8

which we already have a reproducer at
https://groups.google.com/d/msg/syzkaller-bugs/O4DbPiJZFBY/YCVPocx3AgAJ
and debug output is available at
https://groups.google.com/d/msg/syzkaller-bugs/O4DbPiJZFBY/TxQ7WS5ZAwAJ .

We are currently waiting for comments from Peter Hurley who added that code.

Eric Biggers

unread,
May 12, 2018, 11:30:17 PM5/12/18
to Tetsuo Handa, bot+e38be687a2450270a3...@syzkaller.appspotmail.com, pe...@hurleysoftware.com, dvy...@google.com, gre...@linuxfoundation.org, kste...@linuxfoundation.org, linux-...@vger.kernel.org, linu...@kvack.org, pombr...@nexb.com, syzkall...@googlegroups.com, tg...@linutronix.de
Hi Tetsuo,
Actually I did verify that the C reproducer is fixed by the commit I said, and I
also simplified the reproducer and turned it into an LTP test
(http://lists.linux.it/pipermail/ltp/2018-May/008071.html). Like I said, syzbot
is still occasionally hitting the same "BUG: workqueue lockup" error, but
apparently for other reasons. The one on 008464a9360e31b even looks like it's
in the TTY layer too, and it very well could be a very similar bug, but based on
what I observed it's not the same bug that syzbot reproduced on f3b5ad89de16f5d.
Generally it's best to close syzbot bug reports once the original cause is
fixed, so that syzbot can continue to report other bugs with the same signature.
Otherwise they sit on the syzbot dashboard where few people are looking at them.
Though of course, if you are up to it, you're certainly free to look into any of
the crashes already there even before a new bug report gets created.

Note also that a "workqueue lockup" can be caused by almost anything in the
kernel, I think. This one for example is probably in the sound subsystem:
https://syzkaller.appspot.com/text?tag=CrashReport&x=1767232b800000

Thanks!

Eric

Tetsuo Handa

unread,
May 13, 2018, 10:29:50 AM5/13/18
to ebig...@gmail.com, bot+e38be687a2450270a3...@syzkaller.appspotmail.com, pe...@hurleysoftware.com, dvy...@google.com, gre...@linuxfoundation.org, kste...@linuxfoundation.org, linux-...@vger.kernel.org, linu...@kvack.org, pombr...@nexb.com, syzkall...@googlegroups.com, tg...@linutronix.de
Eric Biggers wrote:
> Generally it's best to close syzbot bug reports once the original cause is
> fixed, so that syzbot can continue to report other bugs with the same signature.

That's difficult to judge. Closing as soon as the original cause is fixed allows
syzbot to try to report different reproducer for different bugs. But at the same time,
different/similar bugs which were reported in that report (or comments in the discussion
for that report) will become almost invisible from users (because users unlikely check
other reports in already fixed bugs).

An example is

general protection fault in kernfs_kill_sb (2)
https://syzkaller.appspot.com/bug?id=903af3e08fc7ec60e57d9c9b93b035f4fb038d9a

where the cause of above report was already pointed out in the discussion for
the below report.

general protection fault in kernfs_kill_sb
https://syzkaller.appspot.com/bug?id=d7db6ecf34f099248e4ff404cd381a19a4075653

Since the latter is marked as "fixed on May 08 18:30", I worry that quite few
users would check the relationship.

> Note also that a "workqueue lockup" can be caused by almost anything in the
> kernel, I think. This one for example is probably in the sound subsystem:
> https://syzkaller.appspot.com/text?tag=CrashReport&x=1767232b800000
>

Right. Maybe we should not stop the test upon "workqueue lockup" message, for
it is likely that the cause of lockup is that somebody is busy looping which
should have been reported shortly as "rcu detected stall".

Of course, there is possibility that "workqueue lockup" is reported because
cond_resched() was used when explicit schedule_timeout_*() is required, which
was the reason commit 82607adcf9cdf40f ("workqueue: implement lockup detector")
was added.

If we stop the test upon "workqueue lockup" message, maybe longer timeout (e.g.
300 seconds) is better so that rcu stall or hung task messages are reported
if rcu stall or hung task is occurring.

Dmitry Vyukov

unread,
May 13, 2018, 10:35:53 AM5/13/18
to Tetsuo Handa, Eric Biggers, syzbot, Peter Hurley, Greg Kroah-Hartman, Kate Stewart, LKML, Linux-MM, Philippe Ombredanne, syzkaller-bugs, Thomas Gleixner
Yes, we need order different stalls/lockups/hangs/etc according to
what can trigger what. E.g. rcu stall can trigger task hung and
workqueue lockup, but not the other way around.
There is https://github.com/google/syzkaller/issues/516 to track this.
But I did not yet have time to figure out all required changes.
If you have additional details, please add them there.
Reply all
Reply to author
Forward
0 new messages