net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

92 views
Skip to first unread message

Dmitry Vyukov

unread,
Jan 26, 2016, 6:53:44 AM1/26/16
to Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet
Hello,

I've hit the following warning while running syzkaller fuzzer:

------------[ cut here ]------------
WARNING: CPU: 2 PID: 17409 at kernel/workqueue.c:3968
destroy_workqueue+0x172/0x550()
Modules linked in:
CPU: 2 PID: 17409 Comm: syz-executor Not tainted 4.5.0-rc1+ #283
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
00000000ffffffff ffff88003665f8a0 ffffffff8299a06d 0000000000000000
ffff88003599c740 ffffffff8643f0c0 ffff88003665f8e0 ffffffff8134fcf9
ffffffff8139d4c2 ffffffff8643f0c0 0000000000000f80 ffff8800630c5ae8
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff8299a06d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50
[<ffffffff8134fcf9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482
[<ffffffff8134ff29>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515
[<ffffffff8139d4c2>] destroy_workqueue+0x172/0x550 kernel/workqueue.c:3968
[<ffffffff85a714a4>] hci_unregister_dev+0x264/0x700
net/bluetooth/hci_core.c:3162
[<ffffffff84595ce6>] vhci_release+0x76/0xe0 drivers/bluetooth/hci_vhci.c:341
[<ffffffff817b2376>] __fput+0x236/0x780 fs/file_table.c:208
[<ffffffff817b2945>] ____fput+0x15/0x20 fs/file_table.c:244
[<ffffffff813ad760>] task_work_run+0x170/0x210 kernel/task_work.c:115
[< inline >] exit_task_work include/linux/task_work.h:21
[<ffffffff81358da5>] do_exit+0x8b5/0x2c60 kernel/exit.c:748
[<ffffffff8135b2c8>] do_group_exit+0x108/0x330 kernel/exit.c:878
[<ffffffff8137e434>] get_signal+0x5e4/0x14f0 kernel/signal.c:2307
[<ffffffff811a1db3>] do_signal+0x83/0x1c90 arch/x86/kernel/signal.c:712
[<ffffffff81006685>] exit_to_usermode_loop+0x1a5/0x210
arch/x86/entry/common.c:247
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282
[<ffffffff810084ea>] syscall_return_slowpath+0x2ba/0x340
arch/x86/entry/common.c:344
[<ffffffff863597a2>] int_ret_from_sys_call+0x25/0x9f
arch/x86/entry/entry_64.S:281
---[ end trace f627386faee7426f ]---

Unfortunately I cannot reproduce it in a controlled environment, but
I've hit it twice in different VMs. So maybe if you see something
obvious there. Is it possible that something is submitted into the
workqueue between it is drained and destroyed in hci_unregister_dev?

On commit 92e963f50fc74041b5e9e744c330dca48e04f08d (Jan 24).

Dmitry Vyukov

unread,
Jan 26, 2016, 7:29:42 AM1/26/16
to Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet
Wait, I was able to reproduce it by running the following program in a
parallel loop for several hours:
https://gist.githubusercontent.com/dvyukov/c15675af95e599fe6631/raw/c34117b54d0352f5df4f572d4151a94557780a9b/gistfile1.txt
I think that you just need to open /dev/vhci, because the program does
not seem to do anything useful with the descriptor.
Machine died, full log is here:
https://gist.githubusercontent.com/dvyukov/4584d70c2b4bee7ca875/raw/3794fb9f4c8e0fb2f91261a66725b042b5de4e0f/gistfile1.txt

Jiri Slaby

unread,
Feb 18, 2016, 9:00:40 AM2/18/16
to Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Tejun Heo, Takashi Iwai
Cc Tejun (workqueues), Takashi (debug patch)

On 01/26/2016, 12:53 PM, Dmitry Vyukov wrote:
> Hello,
>
> I've hit the following warning while running syzkaller fuzzer:

Hi,

I am hitting it over and over again using syzkaller.

> WARNING: CPU: 2 PID: 17409 at kernel/workqueue.c:3968
> destroy_workqueue+0x172/0x550()

Which of the warnings is it in your case?

I hit "(pwq != wq->dfl_pwq) && (pwq->refcnt > 1)". So I have this in the
code:
if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1))) {
pr_info("%s: pwq=%p wq->dfl_pwq=%p pwq->refcnt=%d\n",
__func__, pwq, wq->dfl_pwq,
pwq->refcnt);
mutex_unlock(&wq->mutex);
return;
}

if (WARN_ON(pwq->nr_active)) {
...

And the values are:
pwq=ffff88006e271500 wq->dfl_pwq=ffff88006e272400 pwq->refcnt=2
pwq=ffff88002ec48300 wq->dfl_pwq=ffff88002ec4bc00 pwq->refcnt=2
pwq=ffff880020f76400 wq->dfl_pwq=ffff880020f75500 pwq->refcnt=2
pwq=ffff88002a908f00 wq->dfl_pwq=ffff88002a90bc00 pwq->refcnt=1
pwq=ffff8800348e0300 wq->dfl_pwq=ffff8800348e0000 pwq->refcnt=2
pwq=ffff88006e276400 wq->dfl_pwq=ffff88006e275800 pwq->refcnt=2


Note the single "pwq->refcnt=1" in there. So this is perhaps a race?

Takashi also provided me with a debug patch included in the attached
patch. It did not trigger though.
thanks,
--
js
suse labs
vhci_debug.patch

Dmitry Vyukov

unread,
Feb 18, 2016, 9:22:56 AM2/18/16
to Jiri Slaby, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Tejun Heo, Takashi Iwai
On Thu, Feb 18, 2016 at 3:00 PM, Jiri Slaby <jsl...@suse.cz> wrote:
> Cc Tejun (workqueues), Takashi (debug patch)
>
> On 01/26/2016, 12:53 PM, Dmitry Vyukov wrote:
>> Hello,
>>
>> I've hit the following warning while running syzkaller fuzzer:
>
> Hi,
>
> I am hitting it over and over again using syzkaller.
>
>> WARNING: CPU: 2 PID: 17409 at kernel/workqueue.c:3968
>> destroy_workqueue+0x172/0x550()
>
> Which of the warnings is it in your case?


The line number points to WARN_ON(pwq->nr_active), but there can well
be a off-by-one error, so I am not sure.
If you are hitting it all the time, so probably it is the same. Or at
least if makes sense to fix yours first since it is more easily
reproducible.

Tejun Heo

unread,
Feb 18, 2016, 12:44:30 PM2/18/16
to Dmitry Vyukov, Jiri Slaby, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
Hello,

Can you please do the followings?

1. Remove WQ_MEM_RECLAIM from the affected workqueue and see whether
the problem is reproducible. WQ_MEM_RECLAIM on anything bluetooth
doesn't make sense btw. Why is it there?

2. If WQ_MEM_RECLAIM makes the issue go away, see whether the attached
patch works too.

Thanks.

--
tejun
patch

Jiri Slaby

unread,
Feb 19, 2016, 5:20:21 AM2/19/16
to Tejun Heo, Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
Hello,

1. didn't help, the problem persists. So I haven't applied the patch from 2.

Jiri Slaby

unread,
Feb 19, 2016, 7:10:02 AM2/19/16
to Tejun Heo, Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
FWIW I dumped more info about the wq:
wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200
pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing>

Tejun Heo

unread,
Mar 2, 2016, 10:45:09 AM3/2/16
to Jiri Slaby, Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
Hello, Jiri.

On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote:
> > 1. didn't help, the problem persists. So I haven't applied the patch from 2.
>
> FWIW I dumped more info about the wq:
> wq->name='hci0' pwq=ffff8800390d7600 wq->dfl_pwq=ffff8800390d5200
> pwq->refcnt=2 pwq->nr_active=0 delayed_works: <nothing>

Can you please print out the same info for all pwq's during shutdown?
It looks like we're leaking pwq refcnt but I can't spot a place where
that could happen on an empty pwq.

Thanks.

--
tejun

Jiri Slaby

unread,
Mar 3, 2016, 4:12:04 AM3/3/16
to Tejun Heo, Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
Hi,
I have not done that yet, but today, I see:
destroy_workqueue: name='req_hci0' pwq=ffff88002f590300
wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1
in-flight: 18568:wq_barrier_func

Tejun Heo

unread,
Mar 11, 2016, 12:12:09 PM3/11/16
to Jiri Slaby, Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
Hello,

Sorry about the delay.
So, this means that there's flush_work() racing against workqueue
destruction, which can't be safe. :(

Thanks.

--
tejun

Jiri Slaby

unread,
Mar 17, 2016, 8:00:17 AM3/17/16
to Tejun Heo, Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
Hello,
But I cannot trigger the WARN_ONs in the attached patch, so I am
confused how this can happen :(. (While I am still seeing the destroy
WARNINGs.)

BTW. what did you mean by dumping the states at shutdown? Is it still
relevant?
vhci_debug.patch

Tejun Heo

unread,
Mar 18, 2016, 4:52:34 PM3/18/16
to Jiri Slaby, Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
Hello, Jiri.

On Thu, Mar 17, 2016 at 01:00:13PM +0100, Jiri Slaby wrote:
> >> I have not done that yet, but today, I see:
> >> destroy_workqueue: name='req_hci0' pwq=ffff88002f590300
> >> wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
> >> pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1
> >> in-flight: 18568:wq_barrier_func
> >
> > So, this means that there's flush_work() racing against workqueue
> > destruction, which can't be safe. :(
>
> But I cannot trigger the WARN_ONs in the attached patch, so I am
> confused how this can happen :(. (While I am still seeing the destroy
> WARNINGs.)

So, no operations should be in progress when destroy_workqueue() is
called. If somebody was flushing a work item, the flush call must
have returned before destroy_workqueue() was invoked, which doesn't
seem to be the case here. Can you trigger BUG_ON() or sysrq-t when
the above triggers? There must be a task which is flushing a work
item there and it shouldn't be difficult to pinpoint what's going on
from it.

Thanks.

--
tejun

Jiri Slaby

unread,
Mar 21, 2016, 11:58:24 AM3/21/16
to Tejun Heo, Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
Hello,
The output of sysrq-t is here (> 200k), but I cannot see anything
suspicious in it:
http://www.fi.muni.cz/~xslaby/sklad/panics/jctl.txt

This is what the code does now:
+ if ((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) {
+ pr_info("%s: name='%s' pwq=%p wq->dfl_pwq=%p
pwq->refcnt=%d pwq->nr_active=%d delayed_works:",
+ __func__, wq->name, pwq,
wq->dfl_pwq,
+ pwq->refcnt, pwq->nr_active);
+
+ show_pwq(pwq);
+
+ mutex_unlock(&wq->mutex);
+ show_state();
+ show_workqueue_state();
+ WARN_ON(1);
+ return;
+ }

Jiri Slaby

unread,
Mar 22, 2016, 4:09:53 AM3/22/16
to Tejun Heo, Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-b...@vger.kernel.org, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
On 03/21/2016, 04:58 PM, Jiri Slaby wrote:
> Hello,
>
> On 03/18/2016, 09:52 PM, Tejun Heo wrote:
>> On Thu, Mar 17, 2016 at 01:00:13PM +0100, Jiri Slaby wrote:
>>>>> I have not done that yet, but today, I see:
>>>>> destroy_workqueue: name='req_hci0' pwq=ffff88002f590300
>>>>> wq->dfl_pwq=ffff88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
>>>>> pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1
>>>>> in-flight: 18568:wq_barrier_func
>>>>
>>>> So, this means that there's flush_work() racing against workqueue
>>>> destruction, which can't be safe. :(
>>>
>>> But I cannot trigger the WARN_ONs in the attached patch, so I am
>>> confused how this can happen :(. (While I am still seeing the destroy
>>> WARNINGs.)
>>
>> So, no operations should be in progress when destroy_workqueue() is
>> called. If somebody was flushing a work item, the flush call must
>> have returned before destroy_workqueue() was invoked, which doesn't
>> seem to be the case here. Can you trigger BUG_ON() or sysrq-t when
>> the above triggers? There must be a task which is flushing a work
>> item there and it shouldn't be difficult to pinpoint what's going on
>> from it.
>
> The output of sysrq-t is here (> 200k), but I cannot see anything
> suspicious in it:
> http://www.fi.muni.cz/~xslaby/sklad/panics/jctl.txt

Hmm, so I seem I cannot reproduce with this hunk:
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -3139,10 +3139,10 @@ void hci_unregister_dev(struct hci_dev *hdev)
list_del(&hdev->list);
write_unlock(&hci_dev_list_lock);

- hci_dev_do_close(hdev);
-
cancel_work_sync(&hdev->power_on);

+ hci_dev_do_close(hdev);
+
if (!test_bit(HCI_INIT, &hdev->flags) &&
!hci_dev_test_flag(hdev, HCI_SETUP) &&
!hci_dev_test_flag(hdev, HCI_CONFIG)) {



I cannot explain why though. I do not see how it matters in this
particular case...

Dmitry, could you apply it too? But I don't know how often you see the
warning.

PS. next on the table is the gsm tty warning.

Dmitry Vyukov

unread,
Mar 22, 2016, 8:32:30 AM3/22/16
to Jiri Slaby, Tejun Heo, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
I've seen it only several times in several months, so I don't it will
be helpful.

Dmitry Vyukov

unread,
Sep 3, 2016, 6:58:55 AM9/3/16
to Jiri Slaby, Tejun Heo, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
Bad news: I hit it again.
On 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next, so I have
bf389cabb3b8079c23f9762e62b05f291e2d5e99.

WARNING: CPU: 3 PID: 23168 at kernel/workqueue.c:4029
destroy_workqueue+0x172/0x620
CPU: 3 PID: 23168 Comm: syz-executor Not tainted 4.8.0-rc3-next-20160825+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffffffff886b6fe0 ffff880037be77a8 ffffffff82db38d9 ffffffff00000016
fffffbfff10d6dfc ffffffff8708f000 ffff880037be7880 ffffffff8709be00
dffffc0000000000 0000000000000009 ffff880037be7870 ffffffff816cf888
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff82db38d9>] dump_stack+0x12e/0x185 lib/dump_stack.c:51
[<ffffffff816cf888>] panic+0x1e4/0x3fb kernel/panic.c:179
[<ffffffff813863b4>] __warn+0x1c4/0x1e0 kernel/panic.c:542
[<ffffffff8138659c>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
[<ffffffff813dc6d2>] destroy_workqueue+0x172/0x620 kernel/workqueue.c:4029
[<ffffffff8645973f>] hci_unregister_dev+0x3df/0x8a0
net/bluetooth/hci_core.c:3170
[<ffffffff84cb6926>] vhci_release+0x76/0xf0 drivers/bluetooth/hci_vhci.c:354
[<ffffffff81868bbc>] __fput+0x28c/0x780 fs/file_table.c:208
[<ffffffff81869135>] ____fput+0x15/0x20 fs/file_table.c:244
[<ffffffff813ebf63>] task_work_run+0xf3/0x170
[< inline >] exit_task_work include/linux/task_work.h:21
[<ffffffff81394218>] do_exit+0x868/0x2e70 kernel/exit.c:828
[<ffffffff81396998>] do_group_exit+0x108/0x330 kernel/exit.c:958
[<ffffffff813ba4aa>] get_signal+0x62a/0x15d0 kernel/signal.c:2307
[<ffffffff81202333>] do_signal+0x83/0x1f60 arch/x86/kernel/signal.c:805
[<ffffffff81006345>] exit_to_usermode_loop+0x1a5/0x210
arch/x86/entry/common.c:163
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:198
[<ffffffff8100869f>] syscall_return_slowpath+0x2bf/0x340
arch/x86/entry/common.c:267
[<ffffffff86e1079c>] entry_SYSCALL_64_fastpath+0xbf/0xc1

Tejun Heo

unread,
Sep 5, 2016, 9:08:40 AM9/5/16
to Dmitry Vyukov, Jiri Slaby, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
Hello,

On Sat, Sep 03, 2016 at 12:58:33PM +0200, Dmitry Vyukov wrote:
> > I've seen it only several times in several months, so I don't it will
> > be helpful.
>
>
> Bad news: I hit it again.
> On 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next, so I have
> bf389cabb3b8079c23f9762e62b05f291e2d5e99.

Hmmm... we're not getting anywhere. I've applied the following patch
to wq/for-4.8-fixes so that when this happens the next time we can
actually tell what the hell is going wrong.

Thanks.

------ 8< ------
From 278930ada88c972d20025b0f20def27b1a09dff7 Mon Sep 17 00:00:00 2001
From: Tejun Heo <t...@kernel.org>
Date: Mon, 5 Sep 2016 08:54:06 -0400
Subject: [PATCH] workqueue: dump workqueue state on sanity check failures in
destroy_workqueue()

destroy_workqueue() performs a number of sanity checks to ensure that
the workqueue is empty before proceeding with destruction. However,
it's not always easy to tell what's going on just from the warning
message. Let's dump workqueue state after sanity check failures to
help debugging.

Signed-off-by: Tejun Heo <t...@kernel.org>
Link: http://lkml.kernel.org/r/CACT4Y+Zs6vkjHo9qHb4TrEiz...@mail.gmail.com
Cc: Dmitry Vyukov <dvy...@google.com>
---
kernel/workqueue.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ef071ca..4eaec8b8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4021,6 +4021,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
for (i = 0; i < WORK_NR_COLORS; i++) {
if (WARN_ON(pwq->nr_in_flight[i])) {
mutex_unlock(&wq->mutex);
+ show_workqueue_state();
return;
}
}
@@ -4029,6 +4030,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
WARN_ON(pwq->nr_active) ||
WARN_ON(!list_empty(&pwq->delayed_works))) {
mutex_unlock(&wq->mutex);
+ show_workqueue_state();
return;
}
}
--
2.7.4

Dmitry Vyukov

unread,
Sep 5, 2016, 9:14:50 AM9/5/16
to Tejun Heo, Jiri Slaby, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Eric Dumazet, Takashi Iwai
Applied this to my test tree.

Dmitry Vyukov

unread,
Sep 10, 2016, 5:34:10 AM9/10/16
to Tejun Heo, Jiri Slaby, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Takashi Iwai
Hit the WARNING with the patch. It showed "Showing busy workqueues and
worker pools:" after the WARNING, but then no queue info. Was it
already destroyed and removed from the list?...



[ 186.994195] blk_update_request: critical target error, dev sr0, sector 0
[ 187.031368] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[ 187.032172] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current]
[ 187.032816] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code
[ 187.033432] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00
00 00 02 00
[ 187.034124] blk_update_request: critical target error, dev sr0, sector 0
[ 187.069408] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[ 187.070127] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current]
[ 187.070744] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code
[ 187.071368] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00
00 00 04 00
[ 187.071982] blk_update_request: critical target error, dev sr0, sector 0
[ 187.098397] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[ 187.099194] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current]
[ 187.099863] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code
[ 187.100527] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00
00 00 04 00
[ 187.101206] blk_update_request: critical target error, dev sr0, sector 0
[ 187.104486] VFS: Dirty inode writeback failed for block device sr0 (err=-5).
[ 198.113128] ------------[ cut here ]------------
[ 198.113838] WARNING: CPU: 2 PID: 26691 at kernel/workqueue.c:4042
destroy_workqueue+0x17b/0x630
[ 198.114628] Modules linked in:
[ 198.115009] CPU: 2 PID: 26691 Comm: syz-executor Not tainted
4.8.0-rc5-next-20160909+ #23
[ 198.115615] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 198.116032] ffff880060f4f880 ffffffff82d88e39 ffffffff87098a20
fffffbfff10d6ad0
[ 198.116032] 0000000000000000 0000000000000000 ffffffff87098a20
ffffffff813b5abb
[ 198.116032] 0000000000000009 ffff880060f4f8c8 ffffffff81354c3f
ffff880060f4f8e8
[ 198.116032] Call Trace:
[ 198.116032] [<ffffffff82d88e39>] dump_stack+0x12e/0x185
[ 198.116032] [<ffffffff813b5abb>] ? destroy_workqueue+0x17b/0x630
[ 198.116032] [<ffffffff81354c3f>] __warn+0x19f/0x1e0
[ 198.116032] [<ffffffff81354e4c>] warn_slowpath_null+0x2c/0x40
[ 198.116032] [<ffffffff813b5abb>] destroy_workqueue+0x17b/0x630
[ 198.116032] [<ffffffff8642b43f>] hci_unregister_dev+0x3df/0x8a0
[ 198.116032] [<ffffffff8195bc70>] ? fcntl_setlk+0xc30/0xc30
[ 198.116032] [<ffffffff84c86b70>] ? vhci_flush+0x50/0x50
[ 198.116032] [<ffffffff84c86be6>] vhci_release+0x76/0xf0
[ 198.116032] [<ffffffff8183c71c>] __fput+0x28c/0x780
[ 198.116032] [<ffffffff8183cc95>] ____fput+0x15/0x20
[ 198.116032] [<ffffffff813bc6d3>] task_work_run+0xf3/0x170
[ 198.116032] [<ffffffff81364c58>] do_exit+0x868/0x2de0
[ 198.116032] [<ffffffff81934de0>] ? userfaultfd_ioctl+0x27b0/0x27b0
[ 198.116032] [<ffffffff8146c180>] ? debug_check_no_locks_freed+0x3c0/0x3c0
[ 198.116032] [<ffffffff813643f0>] ? mm_update_next_owner+0x640/0x640
[ 198.116032] [<ffffffff819323d0>] ? userfaultfd_release+0x620/0x620
[ 198.116032] [<ffffffff8137fd32>] ? __dequeue_signal+0x132/0x540
[ 198.116032] [<ffffffff8137f896>] ? recalc_sigpending_tsk+0xd6/0x180
[ 198.116032] [<ffffffff81367348>] do_group_exit+0x108/0x330
[ 198.116032] [<ffffffff8138adea>] get_signal+0x62a/0x15d0
[ 198.116032] [<ffffffff811d0333>] do_signal+0x83/0x1e20
[ 198.116032] [<ffffffff81761920>] ? __pmd_alloc+0x3f0/0x3f0
[ 198.116032] [<ffffffff811d02b0>] ? setup_sigcontext+0x7d0/0x7d0
[ 198.116032] [<ffffffff8125d3c5>] ? kvm_clock_get_cycles+0x25/0x30
[ 198.116032] [<ffffffff817720d0>] ? find_vma+0x30/0x150
[ 198.116032] [<ffffffff8126b128>] ? __do_page_fault+0x198/0xbb0
[ 198.116032] [<ffffffff81510500>] ? do_futex+0x17d0/0x17d0
[ 198.116032] [<ffffffff81006335>] exit_to_usermode_loop+0x1a5/0x210
[ 198.116032] [<ffffffff81008384>] prepare_exit_to_usermode+0x154/0x190
[ 198.116032] [<ffffffff86df20b4>] retint_user+0x8/0x23
[ 198.137668] ---[ end trace 1beeef55c15054c4 ]---
[ 198.138229] Showing busy workqueues and worker pools:
[ 198.219611] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[ 198.220424] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current]
[ 198.221207] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code
[ 198.222086] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00
00 00 04 00
[ 198.222872] blk_update_request: critical target error, dev sr0, sector 0
[ 198.223637] buffer_io_error: 78 callbacks suppressed
[ 198.224152] Buffer I/O error on dev sr0, logical block 0, lost async
page write
[ 198.225336] Buffer I/O error on dev sr0, logical block 1, lost async
page write
[ 198.226211] Buffer I/O error on dev sr0, logical block 2, lost async
page write
[ 198.227031] Buffer I/O error on dev sr0, logical block 3, lost async
page write
[ 198.227887] Buffer I/O error on dev sr0, logical block 4, lost async
page write
[ 198.228841] Buffer I/O error on dev sr0, logical block 5, lost async
page write
[ 198.229653] Buffer I/O error on dev sr0, logical block 6, lost async
page write
[ 198.230344] Buffer I/O error on dev sr0, logical block 7, lost async
page write
[ 198.231156] Buffer I/O error on dev sr0, logical block 8, lost async
page write
[ 198.231973] Buffer I/O error on dev sr0, logical block 9, lost async
page write
[ 198.244460] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[ 198.245272] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current]
[ 198.245939] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code
[ 198.246554] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00
00 00 02 00
[ 198.247265] blk_update_request: critical target error, dev sr0, sector 0
[ 198.275421] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[ 198.276249] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current]
[ 198.276865] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code
[ 198.277466] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00
00 00 04 00
[ 198.277998] blk_update_request: critical target error, dev sr0, sector 0
[ 198.333417] sr 1:0:0:0: [sr0] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[ 198.334161] sr 1:0:0:0: [sr0] tag#0 Sense Key : Illegal Request [current]
[ 198.334734] sr 1:0:0:0: [sr0] tag#0 Add. Sense: Invalid command operation code
[ 198.335433] sr 1:0:0:0: [sr0] tag#0 CDB: Write(10) 2a 00 00 00 00 00
00 00 04 00
[ 198.336095] blk_update_request: critical target error, dev sr0, sector 0
[ 198.337116] VFS: Dirty inode writeback failed for block device sr0 (err=-5).

Tejun Heo

unread,
Sep 13, 2016, 11:35:21 AM9/13/16
to Dmitry Vyukov, Jiri Slaby, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Takashi Iwai
Hello,

On Sat, Sep 10, 2016 at 11:33:48AM +0200, Dmitry Vyukov wrote:
> Hit the WARNING with the patch. It showed "Showing busy workqueues and
> worker pools:" after the WARNING, but then no queue info. Was it
> already destroyed and removed from the list?...

Hmm... It either means that the work item which was in flight when
WARN_ON() ran finished by the time the debug printout got to it or
that it's something unrelated to busy work items.

> [ 198.113838] WARNING: CPU: 2 PID: 26691 at kernel/workqueue.c:4042
> destroy_workqueue+0x17b/0x630

I don't seem to have the same source code that you have. Which exact
WARN_ON() is this?

Thanks.

--
tejun

Jiri Slaby

unread,
Sep 13, 2016, 2:14:43 PM9/13/16
to Tejun Heo, Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Takashi Iwai
I assume Dmitry sees the same what I am still seeing, so I reported this
some time ago:
https://lkml.org/lkml/2016/3/21/492

This warning is trigerred there and still occurs with "HEAD":
(pwq != wq->dfl_pwq) && (pwq->refcnt > 1)
and the state dump is in the log empty too:
destroy_workqueue: name='hci0' pwq=ffff88006b5c8f00
wq->dfl_pwq=ffff88006b5c9b00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
pwq 13:
cpus=2-3 node=1 flags=0x4 nice=-20 active=0/1
in-flight: 2669:wq_barrier_func

Tejun Heo

unread,
Sep 16, 2016, 4:24:06 PM9/16/16
to Jiri Slaby, Dmitry Vyukov, Marcel Holtmann, Gustavo Padovan, Johan Hedberg, David S. Miller, linux-bluetooth, netdev, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Takashi Iwai
Hello,

On Tue, Sep 13, 2016 at 08:14:40PM +0200, Jiri Slaby wrote:
> I assume Dmitry sees the same what I am still seeing, so I reported this
> some time ago:
> https://lkml.org/lkml/2016/3/21/492
>
> This warning is trigerred there and still occurs with "HEAD":
> (pwq != wq->dfl_pwq) && (pwq->refcnt > 1)
> and the state dump is in the log empty too:
> destroy_workqueue: name='hci0' pwq=ffff88006b5c8f00
> wq->dfl_pwq=ffff88006b5c9b00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
> pwq 13:
> cpus=2-3 node=1 flags=0x4 nice=-20 active=0/1
> in-flight: 2669:wq_barrier_func

Hmmm... I think it could be from rescuer holding reference on the pwq.
Both cases have WQ_MEM_RECLAIM and it could be that rescuer was still
in flight (even without work items pending) when the sanity checks
were done. The following patch moves the sanity checks after rescuer
destruction. Dmitry, Jiri, can you please see whether the warning
goes away with this patch?

Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 984f6ff..e8046a1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4042,8 +4042,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
}
}

- if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) ||
- WARN_ON(pwq->nr_active) ||
+ if (WARN_ON(pwq->nr_active) ||
WARN_ON(!list_empty(&pwq->delayed_works))) {
mutex_unlock(&wq->mutex);
show_workqueue_state();
@@ -4080,6 +4079,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
for_each_node(node) {
pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
RCU_INIT_POINTER(wq->numa_pwq_tbl[node], NULL);
+ WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt != 1));
put_pwq_unlocked(pwq);
}

@@ -4089,6 +4089,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
*/
pwq = wq->dfl_pwq;
wq->dfl_pwq = NULL;
+ WARN_ON(pwq->refcnt != 1);
put_pwq_unlocked(pwq);
}
}
Reply all
Reply to author
Forward
0 new messages