[syzbot] upstream test error: WARNING in __queue_work

13 views
Skip to first unread message

syzbot

unread,
Aug 29, 2022, 10:07:37 PM8/29/22
to jiangs...@gmail.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com, t...@kernel.org
Hello,

syzbot found the following issue on:

HEAD commit: 4c612826bec1 Merge tag 'net-6.0-rc3' of git://git.kernel.o..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=120ebce7080000
kernel config: https://syzkaller.appspot.com/x/.config?x=312be25752c7fe30
dashboard link: https://syzkaller.appspot.com/bug?extid=243b7d89777f90f7613b
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+243b7d...@syzkaller.appspotmail.com

Bluetooth: hci0: command 0x0409 tx timeout
------------[ cut here ]------------
WARNING: CPU: 0 PID: 52 at kernel/workqueue.c:1438 __queue_work+0xe3f/0x1210 kernel/workqueue.c:1438
Modules linked in:
CPU: 0 PID: 52 Comm: kworker/0:2 Not tainted 6.0.0-rc2-syzkaller-00159-g4c612826bec1 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
Workqueue: events hci_cmd_timeout
RIP: 0010:__queue_work+0xe3f/0x1210 kernel/workqueue.c:1438
Code: e0 07 83 c0 03 38 d0 7c 09 84 d2 74 05 e8 29 09 79 00 8b 5b 2c 31 ff 83 e3 20 89 de e8 9a 5f 2d 00 85 db 75 42 e8 d1 62 2d 00 <0f> 0b e9 41 f8 ff ff e8 c5 62 2d 00 0f 0b e9 d3 f7 ff ff e8 b9 62
RSP: 0018:ffffc90000947c60 EFLAGS: 00010093
RAX: 0000000000000000 RBX: ffff88802c83e200 RCX: 0000000000000000
RDX: ffff88801538a180 RSI: ffffffff814dd75f RDI: ffff88802c83e208
RBP: 0000000000000008 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000200000 R11: 0000000000000000 R12: ffff8880266b4c70
R13: 0000000000000000 R14: ffff888014b1e000 R15: ffff888014b1e000
FS: 0000000000000000(0000) GS:ffff88802c800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c0003d1e80 CR3: 00000000155b2000 CR4: 0000000000150ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
queue_work_on+0xee/0x110 kernel/workqueue.c:1545
process_one_work+0x991/0x1610 kernel/workqueue.c:2289
worker_thread+0x665/0x1080 kernel/workqueue.c:2436
kthread+0x2e4/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
</TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

Lai Jiangshan

unread,
Aug 30, 2022, 10:08:51 AM8/30/22
to syzbot, LKML, syzkall...@googlegroups.com, Tejun Heo, Marcel Holtmann, Johan Hedberg, Luiz Augusto von Dentz, linux-b...@vger.kernel.org
CC: BLUETOOTH SUBSYSTEM

It seems that hci_cmd_timeout() queues a work to a destroyed workqueue.

Luiz Augusto von Dentz

unread,
Aug 30, 2022, 1:37:21 PM8/30/22
to Lai Jiangshan, syzbot, LKML, syzkall...@googlegroups.com, Tejun Heo, Marcel Holtmann, Johan Hedberg, linux-b...@vger.kernel.org
Hi Lai,

On Tue, Aug 30, 2022 at 7:08 AM Lai Jiangshan <jiangs...@gmail.com> wrote:
>
> CC: BLUETOOTH SUBSYSTEM
>
> It seems that hci_cmd_timeout() queues a work to a destroyed workqueue.

Are there any traces or a way to reproduce the problem?

--
Luiz Augusto von Dentz

Tetsuo Handa

unread,
Sep 2, 2022, 7:24:02 AM9/2/22
to Marcel Holtmann, Johan Hedberg, Luiz Augusto von Dentz, Schspa Shi, syzbot, syzkall...@googlegroups.com, jiangs...@gmail.com, t...@kernel.org, linux-b...@vger.kernel.org
syzbot is reporting attempt to schedule hdev->cmd_work work from system_wq
WQ into hdev->workqueue WQ which is under draining operation [1], for
commit c8efcc2589464ac7 ("workqueue: allow chained queueing during
destruction") does not allow such operation.

The check introduced by commit 877afadad2dce8aa ("Bluetooth: When HCI work
queue is drained, only queue chained work") was incomplete.

Use hdev->workqueue WQ when queuing hdev->{cmd,ncmd}_timer works because
hci_{cmd,ncmd}_timeout() calls queue_work(hdev->workqueue). Also, protect
the queuing operation with RCU read lock in order to avoid calling
queue_delayed_work() after cancel_delayed_work() completed.

Link: https://syzkaller.appspot.com/bug?extid=243b7d89777f90f7613b [1]
Reported-by: syzbot <syzbot+243b7d...@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
Fixes: 877afadad2dce8aa ("Bluetooth: When HCI work queue is drained, only queue chained work")
---
This is a difficult to trigger race condition, and therefore reproducer is
not available. Please do logical check in addition to automated testing.

net/bluetooth/hci_core.c | 15 +++++++++++++--
net/bluetooth/hci_event.c | 6 ++++--
2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
index b3a5a3cc9372..9873d2e67988 100644
--- a/net/bluetooth/hci_core.c
+++ b/net/bluetooth/hci_core.c
@@ -597,6 +597,15 @@ static int hci_dev_do_reset(struct hci_dev *hdev)

/* Cancel these to avoid queueing non-chained pending work */
hci_dev_set_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE);
+ /* Wait for
+ *
+ * if (!hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
+ * queue_delayed_work(&hdev->{cmd,ncmd}_timer)
+ *
+ * inside RCU section to see the flag or complete scheduling.
+ */
+ synchronize_rcu();
+ /* Explicitly cancel works in case scheduled after setting the flag. */
cancel_delayed_work(&hdev->cmd_timer);
cancel_delayed_work(&hdev->ncmd_timer);

@@ -4056,12 +4065,14 @@ static void hci_cmd_work(struct work_struct *work)
if (res < 0)
__hci_cmd_sync_cancel(hdev, -res);

+ rcu_read_lock();
if (test_bit(HCI_RESET, &hdev->flags) ||
hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
cancel_delayed_work(&hdev->cmd_timer);
else
- schedule_delayed_work(&hdev->cmd_timer,
- HCI_CMD_TIMEOUT);
+ queue_delayed_work(hdev->workqueue, &hdev->cmd_timer,
+ HCI_CMD_TIMEOUT);
+ rcu_read_unlock();
} else {
skb_queue_head(&hdev->cmd_q, skb);
queue_work(hdev->workqueue, &hdev->cmd_work);
diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
index 6643c9c20fa4..d6f0e6ca0e7e 100644
--- a/net/bluetooth/hci_event.c
+++ b/net/bluetooth/hci_event.c
@@ -3766,16 +3766,18 @@ static inline void handle_cmd_cnt_and_timer(struct hci_dev *hdev, u8 ncmd)
{
cancel_delayed_work(&hdev->cmd_timer);

+ rcu_read_lock();
if (!test_bit(HCI_RESET, &hdev->flags)) {
if (ncmd) {
cancel_delayed_work(&hdev->ncmd_timer);
atomic_set(&hdev->cmd_cnt, 1);
} else {
if (!hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
- schedule_delayed_work(&hdev->ncmd_timer,
- HCI_NCMD_TIMEOUT);
+ queue_delayed_work(hdev->workqueue, &hdev->ncmd_timer,
+ HCI_NCMD_TIMEOUT);
}
}
+ rcu_read_unlock();
}

static u8 hci_cc_le_read_buffer_size_v2(struct hci_dev *hdev, void *data,
--
2.18.4

Aleksandr Nogikh

unread,
Sep 2, 2022, 8:29:04 AM9/2/22
to Luiz Augusto von Dentz, Lai Jiangshan, syzbot, LKML, 'Aleksandr Nogikh' via syzkaller-bugs, Tejun Heo, Marcel Holtmann, Johan Hedberg, linux-b...@vger.kernel.org
Hi,

This one has so far happened only once on syzbot, probably it's either
an extremely rare issue or was already solved.
You can take a look at the console log provided in the original bug report:

console output: https://syzkaller.appspot.com/x/log.txt?x=120ebce7080000

Re. reproduction -- syzbot records a test error when it failed to do
the following sequence of steps:
1) Boot a VM and establish an SSH connection to it
2) Upload fuzzer binaries
3) Start fuzzer binaries; these binaries will set up the fuzzing
environment (networking devices, etc)
4) Execute a simple mmap program to check if coverage collection works fine

mmap(0x1ffff000, 0x1000, 0x0, 0x32, 0xffffffffffffffff, 0x0)
mmap(0x20000000, 0x1000000, 0x7, 0x32, 0xffffffffffffffff, 0x0)
map(0x21000000, 0x1000, 0x0, 0x32, 0xffffffffffffffff, 0x0)

It's probably easiest to start syzkaller locally on this exact kernel
revision and see if the fuzzing is able to start. It will perform the
same steps and report an error, if the issue persists.
I've just tried to reproduce this particular bug myself on
4c612826bec1 and everything booted absolutely fine. So probably it was
just a flake.

FWIW syzbot can also perform patch testing for the reported bugs and
output console logs, so it should also simplify the debugging of such
bugs. More details are here:
https://github.com/google/syzkaller/blob/master/docs/syzbot.md#testing-patches

Patch testing can be done if there's a repro, I've just sent a PR
(https://github.com/google/syzkaller/pull/3355) to add testing to the
exception list -- we can retest that without a repro.

Best Regards,
Aleksandr
>
> --
> Luiz Augusto von Dentz
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bug...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/CABBYNZKNHnrgHfu8JN-kw5UqfEGUVWGyOwK_fLqHP5w8kPc2KA%40mail.gmail.com.

Luiz Augusto von Dentz

unread,
Sep 2, 2022, 2:45:46 PM9/2/22
to Tetsuo Handa, Marcel Holtmann, Johan Hedberg, Schspa Shi, syzbot, syzkall...@googlegroups.com, Lai Jiangshan, Tejun Heo, linux-b...@vger.kernel.org
Hi Tetsuo,

On Fri, Sep 2, 2022 at 4:23 AM Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
>
> syzbot is reporting attempt to schedule hdev->cmd_work work from system_wq
> WQ into hdev->workqueue WQ which is under draining operation [1], for
> commit c8efcc2589464ac7 ("workqueue: allow chained queueing during
> destruction") does not allow such operation.
>
> The check introduced by commit 877afadad2dce8aa ("Bluetooth: When HCI work
> queue is drained, only queue chained work") was incomplete.
>
> Use hdev->workqueue WQ when queuing hdev->{cmd,ncmd}_timer works because
> hci_{cmd,ncmd}_timeout() calls queue_work(hdev->workqueue). Also, protect
> the queuing operation with RCU read lock in order to avoid calling
> queue_delayed_work() after cancel_delayed_work() completed.

Didn't we introduce HCI_CMD_DRAIN_WORKQUEUE exactly to avoid queuing
after the cancel pattern? I wonder if wouldn't be better to introduce
some function that disables/enables the workqueue so we don't have to
do extra tracking in the driver/subsystem?

Luiz Augusto von Dentz

unread,
Sep 2, 2022, 5:31:24 PM9/2/22
to Tetsuo Handa, Marcel Holtmann, Johan Hedberg, Schspa Shi, syzbot, syzkall...@googlegroups.com, Lai Jiangshan, Tejun Heo, linux-b...@vger.kernel.org
Hi Tetsuo,
I was thinking on doing something like the following:

https://gist.github.com/Vudentz/a2288015fedbed366fcdb612264a9d16

Since there is no reason to queue any command if we are draining and
are gonna reset at the end it is pretty useless to queue commands at
that point.

Tetsuo Handa

unread,
Sep 3, 2022, 2:49:43 AM9/3/22
to Luiz Augusto von Dentz, Marcel Holtmann, Johan Hedberg, Schspa Shi, syzbot, syzkall...@googlegroups.com, Lai Jiangshan, Tejun Heo, linux-b...@vger.kernel.org
On 2022/09/03 6:31, Luiz Augusto von Dentz wrote:
> Hi Tetsuo,
>
> On Fri, Sep 2, 2022 at 11:45 AM Luiz Augusto von Dentz <luiz....@gmail.com> wrote:
>> Didn't we introduce HCI_CMD_DRAIN_WORKQUEUE exactly to avoid queuing
>> after the cancel pattern?

HCI_CMD_DRAIN_WORKQUEUE does not help for this case.

What extid=243b7d89777f90f7613b is reporting is

hci_cmd_timeout() { hci_dev_do_reset() {
starts sleeping due to e.g. preemption
hci_dev_set_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE); // Sets HCI_CMD_DRAIN_WORKQUEUE flag
cancel_delayed_work(&hdev->cmd_timer); // does nothing because hci_cmd_timeout() is already running
cancel_delayed_work(&hdev->ncmd_timer);
drain_workqueue(hdev->workqueue) {
sets __WQ_DRAINING flag on hdev->workqueue
starts waiting for completion of all works on hdev->workqueue
finishes sleeping due to e.g. preemption
queue_work(hdev->workqueue, &hdev->cmd_work) // <= complains attempt to queue work from system_wq into __WQ_DRAINING hdev->workqueue
}
finishes waiting for completion of all works on hdev->workqueue
clears __WQ_DRAINING flag
}
}

race condition. Notice that cancel_delayed_work() does not wait for
completion of already started hci_cmd_timeout() callback.

If you need to wait for completion of already started callback,
you need to use _sync version (e.g. cancel_delayed_work_sync()).
And watch out for locking dependency when using _sync version.

>> I wonder if wouldn't be better to introduce
>> some function that disables/enables the workqueue so we don't have to
>> do extra tracking in the driver/subsystem?
>>
>
> I was thinking on doing something like the following:
>
> https://gist.github.com/Vudentz/a2288015fedbed366fcdb612264a9d16

That patch does not close race, for

@@ -4037,6 +4038,10 @@ static void hci_cmd_work(struct work_struct *work)
BT_DBG("%s cmd_cnt %d cmd queued %d", hdev->name,
atomic_read(&hdev->cmd_cnt), skb_queue_len(&hdev->cmd_q));

+ /* Don't queue while draining */
+ if (hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
+ return;
/*
* BUG: WE ARE FREE TO SLEEP FOR ARBITRARY DURATION IMMEDIATELY AFTER CHECKING THE FLAG.
* ANY "TEST AND DO SOMETHING" NEEDS TO BE PROTECTED BY A LOCK MECHANISM.
*/
+
/* Send queued commands */
if (atomic_read(&hdev->cmd_cnt)) {
skb = skb_dequeue(&hdev->cmd_q);

. In other words, HCI_CMD_DRAIN_WORKQUEUE does not fix what extid=63bed493aebbf6872647 is reporting.

If "TEST AND DO SOMETHING" does not sleep, RCU is a handy lock mechanism.

>
> Since there is no reason to queue any command if we are draining and
> are gonna reset at the end it is pretty useless to queue commands at
> that point.

Then, you can add that check.

Luiz Augusto von Dentz

unread,
Sep 3, 2022, 10:11:31 PM9/3/22
to Tetsuo Handa, Marcel Holtmann, Johan Hedberg, Schspa Shi, syzbot, syzkall...@googlegroups.com, Lai Jiangshan, Tejun Heo, linux-b...@vger.kernel.org
Hi Tetsuo,

On Fri, Sep 2, 2022 at 11:49 PM Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
>
> On 2022/09/03 6:31, Luiz Augusto von Dentz wrote:
> > Hi Tetsuo,
> >
> > On Fri, Sep 2, 2022 at 11:45 AM Luiz Augusto von Dentz <luiz....@gmail.com> wrote:
> >> Didn't we introduce HCI_CMD_DRAIN_WORKQUEUE exactly to avoid queuing
> >> after the cancel pattern?
>
> HCI_CMD_DRAIN_WORKQUEUE does not help for this case.
>
> What extid=243b7d89777f90f7613b is reporting is
>
> hci_cmd_timeout() { hci_dev_do_reset() {
> starts sleeping due to e.g. preemption
> hci_dev_set_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE); // Sets HCI_CMD_DRAIN_WORKQUEUE flag
> cancel_delayed_work(&hdev->cmd_timer); // does nothing because hci_cmd_timeout() is already running
> cancel_delayed_work(&hdev->ncmd_timer);
> drain_workqueue(hdev->workqueue) {
> sets __WQ_DRAINING flag on hdev->workqueue
> starts waiting for completion of all works on hdev->workqueue
> finishes sleeping due to e.g. preemption
> queue_work(hdev->workqueue, &hdev->cmd_work) // <= complains attempt to queue work from system_wq into __WQ_DRAINING hdev->workqueue

And we can check for __WQ_DRAINING? Anyway checking
HCI_CMD_DRAIN_WORKQUEUE seems useless so we either have to check if
queue_work can be used or not.

> }
> finishes waiting for completion of all works on hdev->workqueue
> clears __WQ_DRAINING flag
> }
> }
>
> race condition. Notice that cancel_delayed_work() does not wait for
> completion of already started hci_cmd_timeout() callback.
>
> If you need to wait for completion of already started callback,
> you need to use _sync version (e.g. cancel_delayed_work_sync()).
> And watch out for locking dependency when using _sync version.
>
> >> I wonder if wouldn't be better to introduce
> >> some function that disables/enables the workqueue so we don't have to
> >> do extra tracking in the driver/subsystem?
> >>
> >
> > I was thinking on doing something like the following:
> >
> > https://gist.github.com/Vudentz/a2288015fedbed366fcdb612264a9d16
>
> That patch does not close race, for
>
> @@ -4037,6 +4038,10 @@ static void hci_cmd_work(struct work_struct *work)
> BT_DBG("%s cmd_cnt %d cmd queued %d", hdev->name,
> atomic_read(&hdev->cmd_cnt), skb_queue_len(&hdev->cmd_q));
>
> + /* Don't queue while draining */
> + if (hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
> + return;
> /*
> * BUG: WE ARE FREE TO SLEEP FOR ARBITRARY DURATION IMMEDIATELY AFTER CHECKING THE FLAG.
> * ANY "TEST AND DO SOMETHING" NEEDS TO BE PROTECTED BY A LOCK MECHANISM.
> */

Then we need a lock not a flag.

> /* Send queued commands */
> if (atomic_read(&hdev->cmd_cnt)) {
> skb = skb_dequeue(&hdev->cmd_q);
>
> . In other words, HCI_CMD_DRAIN_WORKQUEUE does not fix what extid=63bed493aebbf6872647 is reporting.
>
> If "TEST AND DO SOMETHING" does not sleep, RCU is a handy lock mechanism.
>
> >
> > Since there is no reason to queue any command if we are draining and
> > are gonna reset at the end it is pretty useless to queue commands at
> > that point.
>
> Then, you can add that check.
>


Tejun Heo

unread,
Sep 3, 2022, 10:21:02 PM9/3/22
to Luiz Augusto von Dentz, Tetsuo Handa, Marcel Holtmann, Johan Hedberg, Schspa Shi, syzbot, syzkall...@googlegroups.com, Lai Jiangshan, linux-b...@vger.kernel.org
Hello,

On Sat, Sep 03, 2022 at 07:11:18PM -0700, Luiz Augusto von Dentz wrote:
> And we can check for __WQ_DRAINING? Anyway checking

Please don't do that. That's an internal flag. It shouldn't be *that*
difficult to avoid this without peeking into wq internal state.

Thanks.

--
tejun

Schspa Shi

unread,
Sep 5, 2022, 4:31:25 AM9/5/22
to Tejun Heo, Luiz Augusto von Dentz, Tetsuo Handa, Marcel Holtmann, Johan Hedberg, syzbot, syzkall...@googlegroups.com, Lai Jiangshan, linux-b...@vger.kernel.org
It seems we only need to change hdev->{cmd,ncmd}_timer to
hdev->workqueue, there will be no race because drain_workqueue will
flush all pending work internally.
Any new timeout work will see HCI_CMD_DRAIN_WORKQUEUE flags after we
cancel and flushed all the delayed work.

--
BRs
Schspa Shi

Tetsuo Handa

unread,
Sep 5, 2022, 7:23:51 AM9/5/22
to Schspa Shi, Luiz Augusto von Dentz, Marcel Holtmann, Johan Hedberg, syzbot, syzkall...@googlegroups.com, Lai Jiangshan, linux-b...@vger.kernel.org, Tejun Heo
On 2022/09/05 17:24, Schspa Shi wrote:
>
> Tejun Heo <t...@kernel.org> writes:
>
>> Hello,
>>
>> On Sat, Sep 03, 2022 at 07:11:18PM -0700, Luiz Augusto von Dentz wrote:
>>> And we can check for __WQ_DRAINING? Anyway checking
>>
>> Please don't do that. That's an internal flag. It shouldn't be *that*
>> difficult to avoid this without peeking into wq internal state.
>>
>> Thanks.
>
> It seems we only need to change hdev->{cmd,ncmd}_timer to
> hdev->workqueue, there will be no race because drain_workqueue will
> flush all pending work internally.

True for queue_work(), not always true for queue_delayed_work(). Explained below.

> Any new timeout work will see HCI_CMD_DRAIN_WORKQUEUE flags after we
> cancel and flushed all the delayed work.
>

If you don't mind calling

queue_work(&hdev->cmd_work) followed by hci_cmd_work() (case A below)

and/or

queue_delayed_work(&hdev->ncmd_timer) potentially followed by hci_ncmd_timeout()/hci_reset_dev() (case B and C below)

after observing HCI_CMD_DRAIN_WORKQUEUE flag.
We need to use RCU protection if you mind one of these.



Case A:

hci_dev_do_reset() {
hci_cmd_work() {
if (test_bit(HCI_RESET, &hdev->flags) ||
hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
cancel_delayed_work(&hdev->cmd_timer);
else
queue_delayed_work(hdev->workqueue, &hdev->cmd_timer, HCI_CMD_TIMEOUT);
} else {
skb_queue_head(&hdev->cmd_q, skb);
hci_dev_set_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE);
cancel_delayed_work(&hdev->cmd_timer);
cancel_delayed_work(&hdev->ncmd_timer);
queue_work(hdev->workqueue, &hdev->cmd_work); // Queuing after setting HCI_CMD_DRAIN_WORKQUEUE despite the intent of HCI_CMD_DRAIN_WORKQUEUE...
drain_workqueue(hdev->workqueue); // Will wait for hci_cmd_timeout() queued by queue_work() to complete.

}

// Actual flush() happens here.

hci_dev_clear_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE);
}



Case B:

hci_dev_do_reset() {
handle_cmd_cnt_and_timer(struct hci_dev *hdev, u8 ncmd) {
if (!hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
hci_dev_set_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE);
cancel_delayed_work(&hdev->cmd_timer);
queue_delayed_work(hdev->workqueue, &hdev->ncmd_timer, HCI_NCMD_TIMEOUT);
cancel_delayed_work(&hdev->ncmd_timer); // May or may not cancel hci_ncmd_timeout() queued by queue_delayed_work().
drain_workqueue(hdev->workqueue); // Will wait for hci_ncmd_timeout() queued by queue_delayed_work() to complete if cancel_delayed_work() failed to cancel.

}

// Actual flush() happens here.

hci_dev_clear_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE);
}



Case C:

hci_dev_do_reset() {
handle_cmd_cnt_and_timer(struct hci_dev *hdev, u8 ncmd) {
if (!hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
hci_dev_set_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE);
cancel_delayed_work(&hdev->cmd_timer);
cancel_delayed_work(&hdev->ncmd_timer); // Does nothing.
queue_delayed_work(hdev->workqueue, &hdev->ncmd_timer, HCI_NCMD_TIMEOUT);
drain_workqueue(hdev->workqueue); // Will wait for hci_ncmd_timeout() queued by queue_delayed_work() to complete if delay timer has expired.

}

// Actual flush() happens here, but hci_ncmd_timeout() queued by queue_delayed_work() can be running if delay timer has not expired as of calling drain_workqueue().

hci_dev_clear_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE);
}

Schspa Shi

unread,
Sep 5, 2022, 8:32:18 AM9/5/22
to Tetsuo Handa, Luiz Augusto von Dentz, Marcel Holtmann, Johan Hedberg, syzbot, syzkall...@googlegroups.com, Lai Jiangshan, linux-b...@vger.kernel.org, Tejun Heo

Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp> writes:

> On 2022/09/05 17:24, Schspa Shi wrote:
>>
>> Tejun Heo <t...@kernel.org> writes:
>>
>>> Hello,
>>>
>>> On Sat, Sep 03, 2022 at 07:11:18PM -0700, Luiz Augusto von Dentz wrote:
>>>> And we can check for __WQ_DRAINING? Anyway checking
>>>
>>> Please don't do that. That's an internal flag. It shouldn't be *that*
>>> difficult to avoid this without peeking into wq internal state.
>>>
>>> Thanks.
>>
>> It seems we only need to change hdev->{cmd,ncmd}_timer to
>> hdev->workqueue, there will be no race because drain_workqueue will
>> flush all pending work internally.
>
> True for queue_work(), not always true for queue_delayed_work(). Explained below.
>

Ok, you are right, got it now.
--
BRs
Schspa Shi

patchwork-b...@kernel.org

unread,
Sep 19, 2022, 1:30:19 PM9/19/22
to Tetsuo Handa, mar...@holtmann.org, johan....@gmail.com, luiz....@gmail.com, sch...@gmail.com, syzbot+243b7d...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, jiangs...@gmail.com, t...@kernel.org, linux-b...@vger.kernel.org
Hello:

This patch was applied to bluetooth/bluetooth-next.git (master)
by Luiz Augusto von Dentz <luiz.vo...@intel.com>:

On Fri, 2 Sep 2022 20:23:48 +0900 you wrote:
> syzbot is reporting attempt to schedule hdev->cmd_work work from system_wq
> WQ into hdev->workqueue WQ which is under draining operation [1], for
> commit c8efcc2589464ac7 ("workqueue: allow chained queueing during
> destruction") does not allow such operation.
>
> The check introduced by commit 877afadad2dce8aa ("Bluetooth: When HCI work
> queue is drained, only queue chained work") was incomplete.
>
> [...]

Here is the summary with links:
- Bluetooth: use hdev->workqueue when queuing hdev->{cmd,ncmd}_timer works
https://git.kernel.org/bluetooth/bluetooth-next/c/deee93d13d38

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html


Reply all
Reply to author
Forward
0 new messages