general protection fault in wb_workfn (2)

59 views
Skip to first unread message

syzbot

unread,
May 26, 2018, 5:15:03 AM5/26/18
to linux-...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk
Hello,

syzbot found the following crash on:

HEAD commit: 305bb5521282 Merge tag 'selinux-pr-20180516' of git://git...
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=153eb40f800000
kernel config: https://syzkaller.appspot.com/x/.config?x=f3b4e30da84ec1ed
dashboard link: https://syzkaller.appspot.com/bug?extid=4a7438e774b21ddd8eca
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4a7438...@syzkaller.appspotmail.com

binder: 13169:13171 ioctl 40047459 20000000 returned -22
sock: process `syz-executor6' is using obsolete setsockopt SO_BSDCOMPAT
binder: 13169:13202 Acquire 1 refcount change on invalid ref 0 ret -22
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 88 Comm: kworker/u4:3 Not tainted 4.17.0-rc5+ #55
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Workqueue: writeback wb_workfn
RIP: 0010:dev_name include/linux/device.h:1008 [inline]
RIP: 0010:wb_workfn+0x195/0x1740 fs/fs-writeback.c:1937
RSP: 0018:ffff8801d964f270 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff814e0f15
RDX: 000000000000000a RSI: ffffffff81cd221d RDI: 0000000000000050
RBP: ffff8801d964f750 R08: ffff8801d97c6700 R09: ffffed003b5e46c2
R10: ffffed003b5e46c2 R11: ffff8801daf23613 R12: 0000000000000001
R13: 1ffff1003b2c9f37 R14: ffff8801d964f728 R15: ffff8801d6836f18
FS: 0000000000000000(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fec3a840db8 CR3: 00000001b49ae000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
process_scheduled_works kernel/workqueue.c:2205 [inline]
worker_thread+0xa30/0x1440 kernel/workqueue.c:2284
kthread+0x345/0x410 kernel/kthread.c:240
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 ee 13 00 00 48 8b 9b 08 06 00 00 48
b8 00 00 00 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 <80> 3c 02 00
0f 85 3f 14 00 00 4c 8b 63 50 4d 85 e4 0f 84 a9 0e
RIP: dev_name include/linux/device.h:1008 [inline] RSP: ffff8801d964f270
RIP: wb_workfn+0x195/0x1740 fs/fs-writeback.c:1937 RSP: ffff8801d964f270
---[ end trace baf4ced88bb756b8 ]---


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.

Tetsuo Handa

unread,
May 26, 2018, 8:49:33 PM5/26/18
to syzbot, syzkall...@googlegroups.com, ja...@suse.cz, linux-...@vger.kernel.org, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, ax...@kernel.dk, t...@kernel.org, da...@fromorbit.com, linux...@vger.kernel.org
Forwarding http://lkml.kernel.org/r/201805251915.FGH64...@I-love.SAKURA.ne.jp .

Jan Kara wrote:
> > void delayed_work_timer_fn(struct timer_list *t)
> > {
> > struct delayed_work *dwork = from_timer(dwork, t, timer);
> >
> > /* should have been called from irqsafe timer with irq already off */
> > __queue_work(dwork->cpu, dwork->wq, &dwork->work);
> > }
> >
> > Then, wb_workfn() is after all scheduled even if we check for
> > WB_registered bit, isn't it?
>
> It can be queued after WB_registered bit is cleared but it cannot be queued
> after mod_delayed_work(bdi_wq, &wb->dwork, 0) has finished. That function
> deletes the pending timer (the timer cannot be armed again because
> WB_registered is cleared) and queues what should be the last round of
> wb_workfn().

mod_delayed_work() deletes the pending timer but does not wait for already
invoked timer handler to complete because it is using del_timer() rather than
del_timer_sync(). Then, what happens if __queue_work() is almost concurrently
executed from two CPUs, one from mod_delayed_work(bdi_wq, &wb->dwork, 0) from
wb_shutdown() path (which is called without spin_lock_bh(&wb->work_lock)) and
the other from delayed_work_timer_fn() path (which is called without checking
WB_registered bit under spin_lock_bh(&wb->work_lock)) ?

wb_wakeup_delayed() {
spin_lock_bh(&wb->work_lock);
if (test_bit(WB_registered, &wb->state)) // succeeds
queue_delayed_work(bdi_wq, &wb->d_work, timeout) {
queue_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work, timeout) {
if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&wb->d_work.work))) { // succeeds
__queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work, timeout) {
add_timer(timer); // schedules for delayed_work_timer_fn()
}
}
}
}
spin_unlock_bh(&wb->work_lock);
}

delayed_work_timer_fn() {
// del_timer() already returns false at this point because this timer
// is already inside handler. But something took long here enough to
// wait for __queue_work() from wb_shutdown() path to finish?
__queue_work(WORK_CPU_UNBOUND, bdi_wq, &wb->d_work.work) {
insert_work(pwq, work, worklist, work_flags);
}
}

wb_shutdown() {
mod_delayed_work(bdi_wq, &wb->dwork, 0) {
mod_delayed_work_on(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork, 0) {
ret = try_to_grab_pending(&wb->dwork.work, true, &flags) {
if (likely(del_timer(&wb->dwork.timer))) // fails because already in delayed_work_timer_fn()
return 1;
if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(&wb->dwork.work))) // fails because already set by queue_delayed_work()
return 0;
// Returns 1 or -ENOENT after doing something?
}
if (ret >= 0)
__queue_delayed_work(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork, 0) {
__queue_work(WORK_CPU_UNBOUND, bdi_wq, &wb->dwork.work) {
insert_work(pwq, work, worklist, work_flags);
}
}
}
}
}

Tetsuo Handa

unread,
May 26, 2018, 10:22:40 PM5/26/18
to syzbot, syzkall...@googlegroups.com, ja...@suse.cz, linux-...@vger.kernel.org, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, ax...@kernel.dk, t...@kernel.org, da...@fromorbit.com, linux...@vger.kernel.org
From 8a8222698163d1fe180258566e9a3ff43f54fcd9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
Date: Sun, 27 May 2018 11:08:20 +0900
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is still hitting NULL pointer dereference at wb_workfn() [1].
This might be because we overlooked that delayed_work_timer_fn() does not
check WB_registered before calling __queue_work() while mod_delayed_work()
does not wait for already started delayed_work_timer_fn() because it uses
del_timer() rather than del_timer_sync().

Make wb_shutdown() be careful about wb_wakeup_delayed() path.

[1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+4a7438...@syzkaller.appspotmail.com>
Cc: Tejun Heo <t...@kernel.org>
Cc: Dave Chinner <dchi...@redhat.com>
Cc: Jan Kara <ja...@suse.cz>
Cc: Jens Axboe <ax...@kernel.dk>
---
mm/backing-dev.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7441bd9..31e1d7e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -372,11 +372,24 @@ static void wb_shutdown(struct bdi_writeback *wb)

cgwb_remove_from_bdi_list(wb);
/*
+ * mod_delayed_work() is not appropriate here, for
+ * delayed_work_timer_fn() from wb_wakeup_delayed() does not check
+ * WB_registered before calling __queue_work().
+ */
+ del_timer_sync(&wb->dwork.timer);
+ /*
+ * Clear WORK_STRUCT_PENDING_BIT in order to make sure that next
+ * queue_delayed_work() actually enqueues this work to the tail, for
+ * wb_wakeup_delayed() already set WORK_STRUCT_PENDING_BIT before
+ * scheduling delayed_work_timer_fn().
+ */
+ cancel_delayed_work_sync(&wb->dwork);
+ /*
* Drain work list and shutdown the delayed_work. !WB_registered
* tells wb_workfn() that @wb is dying and its work_list needs to
* be drained no matter what.
*/
- mod_delayed_work(bdi_wq, &wb->dwork, 0);
+ queue_delayed_work(bdi_wq, &wb->dwork, 0);
flush_delayed_work(&wb->dwork);
WARN_ON(!list_empty(&wb->work_list));
/*
--
1.8.3.1


Tejun Heo

unread,
May 26, 2018, 10:36:06 PM5/26/18
to Tetsuo Handa, syzbot, syzkall...@googlegroups.com, ja...@suse.cz, linux-...@vger.kernel.org, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, ax...@kernel.dk, da...@fromorbit.com, linux...@vger.kernel.org
On Sun, May 27, 2018 at 11:21:25AM +0900, Tetsuo Handa wrote:
> From 8a8222698163d1fe180258566e9a3ff43f54fcd9 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
> Date: Sun, 27 May 2018 11:08:20 +0900
> Subject: [PATCH] bdi: Fix another oops in wb_workfn()
>
> syzbot is still hitting NULL pointer dereference at wb_workfn() [1].
> This might be because we overlooked that delayed_work_timer_fn() does not
> check WB_registered before calling __queue_work() while mod_delayed_work()
> does not wait for already started delayed_work_timer_fn() because it uses
> del_timer() rather than del_timer_sync().

It shouldn't be that as dwork timer is an irq safe timer. Even if
that's the case, the right thing to do would be fixing workqueue
rather than reaching into workqueue internals from backing-dev code.

Thanks.

--
tejun

Tetsuo Handa

unread,
May 27, 2018, 12:45:05 AM5/27/18
to t...@kernel.org, syzbot+4a7438...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, ja...@suse.cz, vi...@zeniv.linux.org.uk, ax...@kernel.dk, da...@fromorbit.com, linux...@vger.kernel.org
Do you think that there is possibility that __queue_work() is almost concurrently

Jan Kara

unread,
May 28, 2018, 12:13:49 PM5/28/18
to Tetsuo Handa, syzbot, syzkall...@googlegroups.com, ja...@suse.cz, linux-...@vger.kernel.org, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, ax...@kernel.dk, t...@kernel.org, da...@fromorbit.com, linux...@vger.kernel.org
On Sun 27-05-18 09:47:54, Tetsuo Handa wrote:
> Forwarding http://lkml.kernel.org/r/201805251915.FGH64...@I-love.SAKURA.ne.jp .
>
> Jan Kara wrote:
> > > void delayed_work_timer_fn(struct timer_list *t)
> > > {
> > > struct delayed_work *dwork = from_timer(dwork, t, timer);
> > >
> > > /* should have been called from irqsafe timer with irq already off */
> > > __queue_work(dwork->cpu, dwork->wq, &dwork->work);
> > > }
> > >
> > > Then, wb_workfn() is after all scheduled even if we check for
> > > WB_registered bit, isn't it?
> >
> > It can be queued after WB_registered bit is cleared but it cannot be queued
> > after mod_delayed_work(bdi_wq, &wb->dwork, 0) has finished. That function
> > deletes the pending timer (the timer cannot be armed again because
> > WB_registered is cleared) and queues what should be the last round of
> > wb_workfn().
>
> mod_delayed_work() deletes the pending timer but does not wait for already
> invoked timer handler to complete because it is using del_timer() rather than
> del_timer_sync(). Then, what happens if __queue_work() is almost concurrently
> executed from two CPUs, one from mod_delayed_work(bdi_wq, &wb->dwork, 0) from
> wb_shutdown() path (which is called without spin_lock_bh(&wb->work_lock)) and
> the other from delayed_work_timer_fn() path (which is called without checking
> WB_registered bit under spin_lock_bh(&wb->work_lock)) ?

In this case, work should still be queued only once. The synchronization in
this case should be provided by the WORK_STRUCT_PENDING_BIT. When a delayed
work is queued by mod_delayed_work(), this bit is set, and gets cleared
only once the work is started on some CPU. But admittedly this code is
rather convoluted so I may be missing something.

Also you should note that flush_delayed_work() which follows
mod_delayed_work() in wb_shutdown() does del_timer_sync() so I don't see
how anything could get past that. In fact mod_delayed_work() is in
wb_shutdown() path to make sure wb_workfn() gets executed at least once
before the bdi_writeback structure gets cleaned up so that all queued items
are finished. We do not rely on it to remove pending timers or queued
wb_workfn() executions.

Honza
--
Jan Kara <ja...@suse.com>
SUSE Labs, CR

Tejun Heo

unread,
May 29, 2018, 9:46:18 AM5/29/18
to Tetsuo Handa, syzbot+4a7438...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, ja...@suse.cz, vi...@zeniv.linux.org.uk, ax...@kernel.dk, da...@fromorbit.com, linux...@vger.kernel.org
__queue_work() is gated by WORK_STRUCT_PENDING_BIT, so I don't see how
multiple instances would execute concurrently for the same work item.

Thanks.

--
tejun

Tetsuo Handa

unread,
May 30, 2018, 12:00:59 PM5/30/18
to Jan Kara, syzbot, syzkall...@googlegroups.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, ax...@kernel.dk, t...@kernel.org, da...@fromorbit.com, linux...@vger.kernel.org
So, we have no idea what is happening...
Then, what about starting from temporary debug printk() patch shown below?

>From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
Date: Wed, 30 May 2018 09:57:10 +0900
Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
bdi_unregister() race bug.

syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
limitations that syzbot cannot find reproducer for this bug (frequency is
once or twice per a day) nor we can't capture vmcore in the environment
which syzbot is using, for now we need to rely on printk() debugging.
---
block/Kconfig | 7 +++++++
fs/fs-writeback.c | 31 +++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index 28ec557..fbce13e 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -139,6 +139,13 @@ config BLK_CMDLINE_PARSER

See Documentation/block/cmdline-partition.txt for more information.

+config BLK_DEBUG_WB_WORKFN_RACE
+ bool "Dump upon hitting wb_workfn() versus bdi_unregister() race bug."
+ default n
+ ---help---
+ This is a temporary option used for obtaining information for
+ specific bug. This option will be removed after the bug is fixed.
+
config BLK_WBT
bool "Enable support for block device writeback throttling"
default n
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 471d863..b4dd078 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1934,6 +1934,37 @@ void wb_workfn(struct work_struct *work)
struct bdi_writeback, dwork);
long pages_written;

+#ifdef CONFIG_BLK_DEBUG_WB_WORKFN_RACE
+ if (!wb->bdi->dev) {
+ pr_warn("WARNING: %s: device is NULL\n", __func__);
+ pr_warn("wb->state=%lx\n", wb->state);
+ pr_warn("list_empty(&wb->work_list)=%u\n",
+ list_empty(&wb->work_list));
+ if (!wb->bdi)
+ pr_warn("wb->bdi == NULL\n");
+ else {
+ pr_warn("list_empty(&wb->bdi->bdi_list)=%u\n",
+ list_empty(&wb->bdi->bdi_list));
+ pr_warn("wb->bdi->wb.state=%lx\n", wb->bdi->wb.state);
+ }
+ if (!wb->congested)
+ pr_warn("wb->congested == NULL\n");
+#ifdef CONFIG_CGROUP_WRITEBACK
+ else if (!wb->congested->__bdi)
+ pr_warn("wb->congested->__bdi == NULL\n");
+ else {
+ pr_warn("(wb->congested->__bdi == wb->bdi)=%u\n",
+ wb->congested->__bdi == wb->bdi);
+ pr_warn("list_empty(&wb->congested->__bdi->bdi_list)=%u\n",
+ list_empty(&wb->congested->__bdi->bdi_list));
+ pr_warn("wb->congested->__bdi->wb.state=%lx\n",
+ wb->congested->__bdi->wb.state);
+ }
+#endif
+ /* Will halt shortly due to NULL pointer dereference... */
+ }
+#endif
+
set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
current->flags |= PF_SWAPWRITE;

--
1.8.3.1

Jan Kara

unread,
May 31, 2018, 7:42:30 AM5/31/18
to Tetsuo Handa, Jan Kara, syzbot, syzkall...@googlegroups.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, ax...@kernel.dk, t...@kernel.org, da...@fromorbit.com, linux...@vger.kernel.org
On Thu 31-05-18 01:00:08, Tetsuo Handa wrote:
> So, we have no idea what is happening...
> Then, what about starting from temporary debug printk() patch shown below?
>
> >From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
> Date: Wed, 30 May 2018 09:57:10 +0900
> Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
> bdi_unregister() race bug.
>
> syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
> limitations that syzbot cannot find reproducer for this bug (frequency is
> once or twice per a day) nor we can't capture vmcore in the environment
> which syzbot is using, for now we need to rely on printk() debugging.
>
> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
>
> Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>

Hum a bit ugly solution but if others are fine with this, I can live with
it for a while as well. Or would it be possible for syzkaller to just test
some git tree where this patch is included? Then we would not even have to
have the extra config option...

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 471d863..b4dd078 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1934,6 +1934,37 @@ void wb_workfn(struct work_struct *work)
> struct bdi_writeback, dwork);
> long pages_written;
>
> +#ifdef CONFIG_BLK_DEBUG_WB_WORKFN_RACE
> + if (!wb->bdi->dev) {
> + pr_warn("WARNING: %s: device is NULL\n", __func__);
> + pr_warn("wb->state=%lx\n", wb->state);
> + pr_warn("list_empty(&wb->work_list)=%u\n",
> + list_empty(&wb->work_list));
> + if (!wb->bdi)

This is not possible when we dereferences wb->bdi above...

> + pr_warn("wb->bdi == NULL\n");
> + else {
> + pr_warn("list_empty(&wb->bdi->bdi_list)=%u\n",
> + list_empty(&wb->bdi->bdi_list));
> + pr_warn("wb->bdi->wb.state=%lx\n", wb->bdi->wb.state);
> + }

It would be also good to print whether wb == wb->bdi->wb (i.e. it is the
default writeback structure or one for some cgroup) and also
wb->bdi->wb.state.

Honza

> + if (!wb->congested)
> + pr_warn("wb->congested == NULL\n");
> +#ifdef CONFIG_CGROUP_WRITEBACK
> + else if (!wb->congested->__bdi)
> + pr_warn("wb->congested->__bdi == NULL\n");
> + else {
> + pr_warn("(wb->congested->__bdi == wb->bdi)=%u\n",
> + wb->congested->__bdi == wb->bdi);
> + pr_warn("list_empty(&wb->congested->__bdi->bdi_list)=%u\n",
> + list_empty(&wb->congested->__bdi->bdi_list));
> + pr_warn("wb->congested->__bdi->wb.state=%lx\n",
> + wb->congested->__bdi->wb.state);
> + }
> +#endif
> + /* Will halt shortly due to NULL pointer dereference... */
> + }
> +#endif
> +
> set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
> current->flags |= PF_SWAPWRITE;
>
> --
> 1.8.3.1
>

Tetsuo Handa

unread,
May 31, 2018, 9:20:37 AM5/31/18
to Jan Kara, syzbot, syzkall...@googlegroups.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, ax...@kernel.dk, t...@kernel.org, da...@fromorbit.com, linux...@vger.kernel.org, Dmitry Vyukov
On 2018/05/31 20:42, Jan Kara wrote:
> On Thu 31-05-18 01:00:08, Tetsuo Handa wrote:
>> So, we have no idea what is happening...
>> Then, what about starting from temporary debug printk() patch shown below?
>>
>> >From 4f70f72ad3c9ae6ce1678024ef740aca4958e5b0 Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
>> Date: Wed, 30 May 2018 09:57:10 +0900
>> Subject: [PATCH] bdi: Add temporary config for debugging wb_workfn() versus
>> bdi_unregister() race bug.
>>
>> syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
>> limitations that syzbot cannot find reproducer for this bug (frequency is
>> once or twice per a day) nor we can't capture vmcore in the environment
>> which syzbot is using, for now we need to rely on printk() debugging.
>>
>> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
>>
>> Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
>
> Hum a bit ugly solution but if others are fine with this, I can live with
> it for a while as well. Or would it be possible for syzkaller to just test
> some git tree where this patch is included? Then we would not even have to
> have the extra config option...

If syzbot can reproduce this bug that way. While it is possible to add/remove
git trees syzbot tests, frequently adding/removing trees is bothering.

syzbot can enable extra config option. Maybe the config name should be
something like CONFIG_DEBUG_FOR_SYZBOT rather than individual topic.

I think that syzbot is using many VM instances. I don't know how many
instances will be needed for reproducing this bug within reasonable period.
More git trees syzbot tests, (I assume that) longer period will be needed
for reproducing this bug. The most reliable way is to use the shared part
of all trees (i.e. linux.git).

>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index 471d863..b4dd078 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -1934,6 +1934,37 @@ void wb_workfn(struct work_struct *work)
>> struct bdi_writeback, dwork);
>> long pages_written;
>>
>> +#ifdef CONFIG_BLK_DEBUG_WB_WORKFN_RACE
>> + if (!wb->bdi->dev) {
>> + pr_warn("WARNING: %s: device is NULL\n", __func__);
>> + pr_warn("wb->state=%lx\n", wb->state);
>> + pr_warn("list_empty(&wb->work_list)=%u\n",
>> + list_empty(&wb->work_list));
>> + if (!wb->bdi)
>
> This is not possible when we dereferences wb->bdi above...

Oops. I missed it.

>
>> + pr_warn("wb->bdi == NULL\n");
>> + else {
>> + pr_warn("list_empty(&wb->bdi->bdi_list)=%u\n",
>> + list_empty(&wb->bdi->bdi_list));
>> + pr_warn("wb->bdi->wb.state=%lx\n", wb->bdi->wb.state);
>> + }
>
> It would be also good to print whether wb == wb->bdi->wb (i.e. it is the
> default writeback structure or one for some cgroup) and also
> wb->bdi->wb.state.
>

wb->bdi->wb.state is already printed. Updated patch is shown below.
Anything else to print?



From 3f3346d42b804e59d12caaa525365a8482505f08 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
Date: Thu, 31 May 2018 22:07:20 +0900
Subject: [PATCH v2] bdi: Add temporary config for debugging wb_workfn() versus
bdi_unregister() race bug.

syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
limitations that syzbot cannot find reproducer for this bug (frequency is
once or twice per a day) nor we can't capture vmcore in the environment
which syzbot is using, for now we need to rely on printk() debugging.

[1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
---
block/Kconfig | 7 +++++++
fs/fs-writeback.c | 28 ++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index 28ec557..fbce13e 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -139,6 +139,13 @@ config BLK_CMDLINE_PARSER

See Documentation/block/cmdline-partition.txt for more information.

+config BLK_DEBUG_WB_WORKFN_RACE
+ bool "Dump upon hitting wb_workfn() versus bdi_unregister() race bug."
+ default n
+ ---help---
+ This is a temporary option used for obtaining information for
+ specific bug. This option will be removed after the bug is fixed.
+
config BLK_WBT
bool "Enable support for block device writeback throttling"
default n
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 471d863..14ab873 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1934,6 +1934,34 @@ void wb_workfn(struct work_struct *work)
struct bdi_writeback, dwork);
long pages_written;

+#ifdef CONFIG_BLK_DEBUG_WB_WORKFN_RACE
+ if (!wb->bdi->dev) {
+ pr_warn("WARNING: %s: device is NULL\n", __func__);
+ pr_warn("wb->state=%lx\n", wb->state);
+ pr_warn("(wb == &wb->bdi->wb)=%u\n", wb == &wb->bdi->wb);
+ pr_warn("list_empty(&wb->work_list)=%u\n",
+ list_empty(&wb->work_list));
+ pr_warn("list_empty(&wb->bdi->bdi_list)=%u\n",
+ list_empty(&wb->bdi->bdi_list));
+ pr_warn("wb->bdi->wb.state=%lx\n", wb->bdi->wb.state);

Jan Kara

unread,
May 31, 2018, 9:42:14 AM5/31/18
to Tetsuo Handa, Jan Kara, syzbot, syzkall...@googlegroups.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, ax...@kernel.dk, t...@kernel.org, da...@fromorbit.com, linux...@vger.kernel.org, Dmitry Vyukov
I understand this, I'd be just a bit reluctant to merge temporary debug
patches like this to Linus' tree only to revert them later just because
syzkaller... What do others think?

> From 3f3346d42b804e59d12caaa525365a8482505f08 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
> Date: Thu, 31 May 2018 22:07:20 +0900
> Subject: [PATCH v2] bdi: Add temporary config for debugging wb_workfn() versus
> bdi_unregister() race bug.
>
> syzbot is hitting NULL pointer dereference at wb_workfn() [1]. But due to
> limitations that syzbot cannot find reproducer for this bug (frequency is
> once or twice per a day) nor we can't capture vmcore in the environment
> which syzbot is using, for now we need to rely on printk() debugging.
>
> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
>
> Signed-off-by: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>

The debug patch looks good to me now. Thanks for writing it.

Honza

Jens Axboe

unread,
May 31, 2018, 12:56:16 PM5/31/18
to Jan Kara, Tetsuo Handa, syzbot, syzkall...@googlegroups.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, t...@kernel.org, da...@fromorbit.com, linux...@vger.kernel.org, Dmitry Vyukov
I guess I don't understand why having it in Linus's tree would make
a difference to syzkaller?

If there is a compelling reason why that absolutely has to be done,
I don't think it should be accompanied by a special Kconfig option
for it. It should just be on unconditionally, with the intent to
remove it before release.

--
Jens Axboe

Dave Chinner

unread,
May 31, 2018, 10:30:14 PM5/31/18
to Jan Kara, Tetsuo Handa, syzbot, syzkall...@googlegroups.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, ax...@kernel.dk, t...@kernel.org, linux...@vger.kernel.org, Dmitry Vyukov
Don't commit temporary debug crap to the mainline kernel. Enable
syzkaller to run against a user supplied git tree specification if
it needs special debug to track down a problem.

-Dave.
--
Dave Chinner
da...@fromorbit.com

Tetsuo Handa

unread,
Jun 5, 2018, 9:46:22 AM6/5/18
to Dmitry Vyukov, Jens Axboe, Jan Kara, syzbot, syzkall...@googlegroups.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, t...@kernel.org, da...@fromorbit.com, linux...@vger.kernel.org, Linus Torvalds
Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...

Dmitry Vyukov

unread,
Jun 7, 2018, 2:46:54 PM6/7/18
to Tetsuo Handa, Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Tejun Heo, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
On Tue, Jun 5, 2018 at 3:45 PM, Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
> Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
> against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...

Hi Tetsuo,

Most of the reasons for not doing it still stand. A syzkaller instance
will produce not just this bug, it will produce hundreds of different
bugs. Then the question is: what to do with these bugs? Report all to
mailing lists?
I think the solution here is just to run syzkaller instance locally.
It's just a program anybody can run it on any kernel with any custom
patches. Moreover for local instance it's also possible to limit set
of tested syscalls to increase probability of hitting this bug and at
the same time filter out most of other bugs.

Do we have any idea about the guilty subsystem? You mentioned
bdi_unregister, why? What would be the set of syscalls to concentrate
on?
I will do a custom run when I get around to it, if nobody else beats me to it.

Tetsuo Handa

unread,
Jun 7, 2018, 10:31:58 PM6/7/18
to Dmitry Vyukov, Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Tejun Heo, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
Dmitry Vyukov wrote:
> On Tue, Jun 5, 2018 at 3:45 PM, Tetsuo Handa
> <penguin...@i-love.sakura.ne.jp> wrote:
> > Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
> > against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...
>
> Hi Tetsuo,
>
> Most of the reasons for not doing it still stand. A syzkaller instance
> will produce not just this bug, it will produce hundreds of different
> bugs. Then the question is: what to do with these bugs? Report all to
> mailing lists?

Is it possible to add linux-next.git tree as a target for fuzzing? If yes,
we can try debug patches easily, in addition to find bugs earlier than now.

> I think the solution here is just to run syzkaller instance locally.
> It's just a program anybody can run it on any kernel with any custom
> patches. Moreover for local instance it's also possible to limit set
> of tested syscalls to increase probability of hitting this bug and at
> the same time filter out most of other bugs.

If this bug is reproducible with VM resources individual developer can afford...

Since my Linux development environment is VMware guests on a Windows PC, I can't
run VM instance which needs KVM acceleration. Also, due to security policy, I can't
utilize external VM resources available on the Internet, as well as I can't use ssh
and git protocols. Speak of this bug, even with a lot of VM instances, syzbot can
reproduce this bug only once or twice per a day. Thus, the question for me boils
down to, whether I can reproduce this bug using one VMware guest instance with 4GB
of memory. Effectively, I don't have access to environments for running syzkaller
instance...

>
> Do we have any idea about the guilty subsystem? You mentioned
> bdi_unregister, why? What would be the set of syscalls to concentrate
> on?
> I will do a custom run when I get around to it, if nobody else beats me to it.

Because bdi_unregister() does "bdi->dev = NULL;" which wb_workfn() is hitting
NULL pointer dereference.

Dmitry Vyukov

unread,
Jun 8, 2018, 10:45:51 AM6/8/18
to Tetsuo Handa, Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Tejun Heo, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
On Fri, Jun 8, 2018 at 4:31 AM, Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
> Dmitry Vyukov wrote:
>> On Tue, Jun 5, 2018 at 3:45 PM, Tetsuo Handa
>> <penguin...@i-love.sakura.ne.jp> wrote:
>> > Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight
>> > against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...
>>
>> Hi Tetsuo,
>>
>> Most of the reasons for not doing it still stand. A syzkaller instance
>> will produce not just this bug, it will produce hundreds of different
>> bugs. Then the question is: what to do with these bugs? Report all to
>> mailing lists?
>
> Is it possible to add linux-next.git tree as a target for fuzzing? If yes,
> we can try debug patches easily, in addition to find bugs earlier than now.

syzbot tested linux-next and mmotm initially, but they were removed at
the request of kernel developers. See:
https://groups.google.com/d/msg/syzkaller/0H0LHW_ayR8/dsK5qGB_AQAJ
and:
https://groups.google.com/d/msg/syzkaller-bugs/FeAgni6Atlk/U0JGoR0AAwAJ
Indeed, linux-next produces around 50 assorted one-off unexplainable
bug reports.


>> I think the solution here is just to run syzkaller instance locally.
>> It's just a program anybody can run it on any kernel with any custom
>> patches. Moreover for local instance it's also possible to limit set
>> of tested syscalls to increase probability of hitting this bug and at
>> the same time filter out most of other bugs.
>
> If this bug is reproducible with VM resources individual developer can afford...
>
> Since my Linux development environment is VMware guests on a Windows PC, I can't
> run VM instance which needs KVM acceleration. Also, due to security policy, I can't
> utilize external VM resources available on the Internet, as well as I can't use ssh
> and git protocols. Speak of this bug, even with a lot of VM instances, syzbot can
> reproduce this bug only once or twice per a day. Thus, the question for me boils
> down to, whether I can reproduce this bug using one VMware guest instance with 4GB
> of memory. Effectively, I don't have access to environments for running syzkaller
> instance...

Well, I don't know what to say, it does require some resources.

>> Do we have any idea about the guilty subsystem? You mentioned
>> bdi_unregister, why? What would be the set of syscalls to concentrate
>> on?
>> I will do a custom run when I get around to it, if nobody else beats me to it.
>
> Because bdi_unregister() does "bdi->dev = NULL;" which wb_workfn() is hitting
> NULL pointer dereference.

Right, wb_workfn is not a generic function, it's fs-specific function.

Trying to reproduce this locally now.

Dmitry Vyukov

unread,
Jun 8, 2018, 11:16:26 AM6/8/18
to Tetsuo Handa, Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Tejun Heo, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
No luck so far.

Trying to look from a different angle: is it possible that bdi->dev is
not set yet, rather then already reset?

Dmitry Vyukov

unread,
Jun 8, 2018, 12:54:00 PM6/8/18
to Tetsuo Handa, Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Tejun Heo, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
I was able to reproduce this once locally running syz-crush utility
replaying one of the crash logs. Now running with Tetsuo's patch.

I can say we hunting a very subtle race condition with short
inconsistency window, perhaps few instructions.

Dmitry Vyukov

unread,
Jun 8, 2018, 1:15:00 PM6/8/18
to Tetsuo Handa, Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Tejun Heo, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
Here we go:

[ 2853.033175] WARNING: wb_workfn: device is NULL
[ 2853.034709] wb->state=2
[ 2853.035486] (wb == &wb->bdi->wb)=0
[ 2853.036489] list_empty(&wb->work_list)=1
[ 2853.037603] list_empty(&wb->bdi->bdi_list)=0
[ 2853.038843] wb->bdi->wb.state=0
[ 2853.039819] (wb->congested->__bdi == wb->bdi)=1
[ 2853.041062] list_empty(&wb->congested->__bdi->bdi_list)=0
[ 2853.042609] wb->congested->__bdi->wb.state=0
[ 2853.043793] kasan: CONFIG_KASAN_INLINE enabled
[ 2853.045315] kasan: GPF could be caused by NULL-ptr deref or user
memory access
[ 2853.047376] general protection fault: 0000 [#1] SMP KASAN
[ 2853.048980] CPU: 1 PID: 13971 Comm: kworker/u12:8 Not tainted 4.17.0+ #21
[ 2853.050762] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[ 2853.053034] Workqueue: writeback wb_workfn
[ 2853.054193] RIP: 0010:wb_workfn+0x187/0xab0
[ 2853.055360] Code: 85 70 fd ff ff 48 83 e8 10 48 89 85 60 fd ff ff
e8 5e 38 ab ff 48 8d 7b 50 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
c1 ea 03 <80> 3c 02 00 0f 85 05 08 00 00 4c 8b 63 50 4d 85 e4 0f 84 b5
05 00
[ 2853.060692] RSP: 0018:ffff8800600ef480 EFLAGS: 00010206
[ 2853.062210] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2853.064215] RDX: 000000000000000a RSI: ffffffff81cf0312 RDI: 0000000000000050
[ 2853.066198] RBP: ffff8800600ef750 R08: ffff880061e30400 R09: ffffed000d8ccfc0
[ 2853.068037] R10: ffffed000d8ccfc0 R11: ffff88006c667e07 R12: 1ffff1000c01dead
[ 2853.069970] R13: ffff8800600ef728 R14: ffff8800676bd180 R15: dffffc0000000000
[ 2853.071932] FS: 0000000000000000(0000) GS:ffff88006c640000(0000)
knlGS:0000000000000000
[ 2853.074080] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2853.075699] CR2: 00007ffc8478c000 CR3: 0000000008c6a006 CR4: 00000000001626e0
[ 2853.077633] Call Trace:
[ 2853.078341] ? inode_wait_for_writeback+0x40/0x40
[ 2853.079642] ? graph_lock+0x170/0x170
[ 2853.080710] ? lock_downgrade+0x8e0/0x8e0
[ 2853.081889] ? find_held_lock+0x36/0x1c0
[ 2853.083047] ? graph_lock+0x170/0x170
[ 2853.084105] ? lock_acquire+0x1dc/0x520
[ 2853.085216] ? process_one_work+0xb8c/0x1b70
[ 2853.086425] ? kasan_check_read+0x11/0x20
[ 2853.087608] ? __lock_is_held+0xb5/0x140
[ 2853.088720] process_one_work+0xc64/0x1b70
[ 2853.089875] ? finish_task_switch+0x182/0x840
[ 2853.091085] ? pwq_dec_nr_in_flight+0x490/0x490
[ 2853.092358] ? __schedule+0x809/0x1e30
[ 2853.093475] ? retint_kernel+0x10/0x10
[ 2853.094561] ? retint_kernel+0x10/0x10
[ 2853.095610] ? graph_lock+0x170/0x170
[ 2853.096623] ? trace_hardirqs_on_caller+0x421/0x5c0
[ 2853.097981] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 2853.099298] ? find_held_lock+0x36/0x1c0
[ 2853.100394] ? lock_acquire+0x1dc/0x520
[ 2853.101472] ? worker_thread+0x3d4/0x13a0
[ 2853.102558] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[ 2853.104060] ? move_linked_works+0x2f6/0x470
[ 2853.105204] ? trace_event_raw_event_workqueue_execute_start+0x290/0x290
[ 2853.107013] ? do_raw_spin_trylock+0x1b0/0x1b0
[ 2853.108243] worker_thread+0x9e5/0x13a0
[ 2853.109304] ? process_one_work+0x1b70/0x1b70
[ 2853.110488] ? graph_lock+0x170/0x170
[ 2853.111514] ? find_held_lock+0x36/0x1c0
[ 2853.112610] ? find_held_lock+0x36/0x1c0
[ 2853.113674] ? __schedule+0x1e30/0x1e30
[ 2853.114714] ? do_raw_spin_unlock+0x1f9/0x2e0
[ 2853.115885] ? do_raw_spin_trylock+0x1b0/0x1b0
[ 2853.117060] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
[ 2853.118492] ? __kthread_parkme+0x111/0x1d0
[ 2853.119616] ? parse_args.cold.15+0x1b3/0x1b3
[ 2853.120804] ? trace_hardirqs_on_caller+0x421/0x5c0
[ 2853.122071] ? trace_hardirqs_on+0xd/0x10
[ 2853.123117] kthread+0x345/0x410
[ 2853.123999] ? process_one_work+0x1b70/0x1b70
[ 2853.125174] ? kthread_bind+0x40/0x40
[ 2853.126201] ret_from_fork+0x3a/0x50
[ 2853.127199] Modules linked in:
[ 2853.128022] Dumping ftrace buffer:
[ 2853.128901] (ftrace buffer empty)
[ 2853.129986] ---[ end trace 3ba28e076cb32fda ]---
[ 2853.131269] RIP: 0010:wb_workfn+0x187/0xab0
[ 2853.132441] Code: 85 70 fd ff ff 48 83 e8 10 48 89 85 60 fd ff ff
e8 5e 38 ab ff 48 8d 7b 50 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48
c1 ea 03 <80> 3c 02 00 0f 85 05 08 00 00 4c 8b 63 50 4d 85 e4 0f 84 b5
05 00
[ 2853.137449] RSP: 0018:ffff8800600ef480 EFLAGS: 00010206
[ 2853.138618] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2853.140176] RDX: 000000000000000a RSI: ffffffff81cf0312 RDI: 0000000000000050
[ 2853.141722] RBP: ffff8800600ef750 R08: ffff880061e30400 R09: ffffed000d8ccfc0
[ 2853.143300] R10: ffffed000d8ccfc0 R11: ffff88006c667e07 R12: 1ffff1000c01dead
[ 2853.144841] R13: ffff8800600ef728 R14: ffff8800676bd180 R15: dffffc0000000000
[ 2853.146391] FS: 0000000000000000(0000) GS:ffff88006c640000(0000)
knlGS:0000000000000000
[ 2853.148141] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2853.149406] CR2: 00007ffc8478c000 CR3: 0000000008c6a006 CR4: 00000000001626e0
[ 2853.150968] Kernel panic - not syncing: Fatal exception
[ 2853.152419] Dumping ftrace buffer:
[ 2853.153121] (ftrace buffer empty)
[ 2853.153786] Kernel Offset: disabled
[ 2853.154442] Rebooting in 86400 seconds..

Tetsuo Handa

unread,
Jun 9, 2018, 1:31:32 AM6/9/18
to Dmitry Vyukov, Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Tejun Heo, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
Dmitry Vyukov wrote:
> Here we go:

Great. Thank you.

>
> [ 2853.033175] WARNING: wb_workfn: device is NULL
> [ 2853.034709] wb->state=2
>

It is surprising that wb->state == WB_shutting_down .

WB_shutting_down is set by only wb_shutdown() and is always cleared
before leaving wb_shutdown(). This means that someone was calling
wb_shutdown() on this wb object. And bdi->dev == NULL means that
bdi_unregister() already did bdi->dev = NULL while someone was still
inside wb_shutdown().

Since we call wb_shutdown() from bdi_unregister() for each wb object
on this bdi object, this should not happen. But since "INFO: task hung
in wb_shutdown (2)" found that it is possible that wb_shutdown() is
concurrently called on the same wb object, there might be something
complicated concurrency.

Well, is it really true that "we call wb_shutdown() from bdi_unregister()
for each wb object on this bdi object"? It seems it is not always true...

While cgwb_bdi_unregister() from bdi_unregister() calls wb_shutdown() on
each wb object reachable from bdi->wb_list, wb_shutdown() firstly calls
list_del_rcu(&wb->bdi_node) (which was added by
list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list) from cgwb_create()) and
then starts waiting for that wb object by calling
mod_delayed_work()/flush_delayed_work() and then clears WB_shutting_down.

Then, it is possible that cgwb_bdi_unregister() from calls wb_shutdown()
fails to find a wb object which already passed list_del_rcu() from
wb_shutdown(), and cgwb_bdi_unregister() can return without waiting for
somebody who is waiting inside wb_shutdown(). Hence, allows doing
bdi->dev = NULL before a wb object which somebody is waiting inside
wb_shutdown() completes wb_workfn(), and NULL pointer dereference...

Tetsuo Handa

unread,
Jun 9, 2018, 10:00:52 AM6/9/18
to Dmitry Vyukov, Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Tejun Heo, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
From 014c4149f2e24cd26b278b32d5dfda056eecf093 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
Date: Sat, 9 Jun 2018 22:47:52 +0900
Subject: [PATCH] bdi: Fix another oops in wb_workfn()

syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

Since cgwb_bdi_unregister() from bdi_unregister() cannot call wb_shutdown()
on wb objects which have already passed list_del_rcu() in wb_shutdown(),
cgwb_bdi_unregister() from bdi_unregister() can return and set wb->bdi->dev
to NULL before such wb objects enter final round of wb_workfn() via
mod_delayed_work()/flush_delayed_work().

Since WB_registered is already cleared by wb_shutdown(), only wb_shutdown()
can schedule for final round of wb_workfn(). Since concurrent calls to
wb_shutdown() on the same wb object is safe because of WB_shutting_down
state, I think that wb_shutdown() can safely keep a wb object in the
bdi->wb_list until that wb object leaves final round of wb_workfn().
Thus, make wb_shutdown() call list_del_rcu() after flush_delayed_work().
Reported-by: syzbot <syzbot+4a7438...@syzkaller.appspotmail.com>
Cc: Dmitry Vyukov <dvy...@google.com>
Cc: Tejun Heo <t...@kernel.org>
Cc: Jan Kara <ja...@suse.cz>
Cc: Jens Axboe <ax...@fb.com>
---
mm/backing-dev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..bef4b25 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -370,7 +370,6 @@ static void wb_shutdown(struct bdi_writeback *wb)
set_bit(WB_shutting_down, &wb->state);
spin_unlock_bh(&wb->work_lock);

- cgwb_remove_from_bdi_list(wb);
/*
* Drain work list and shutdown the delayed_work. !WB_registered
* tells wb_workfn() that @wb is dying and its work_list needs to
@@ -379,6 +378,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, &wb->dwork, 0);
flush_delayed_work(&wb->dwork);
WARN_ON(!list_empty(&wb->work_list));
+ cgwb_remove_from_bdi_list(wb);
/*
* Make sure bit gets cleared after shutdown is finished. Matches with
* the barrier provided by test_and_clear_bit() above.
--
1.8.3.1


Jan Kara

unread,
Jun 11, 2018, 5:12:51 AM6/11/18
to Tetsuo Handa, Dmitry Vyukov, Jens Axboe, Jan Kara, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Tejun Heo, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
On Sat 09-06-18 23:00:05, Tetsuo Handa wrote:
> From 014c4149f2e24cd26b278b32d5dfda056eecf093 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin...@I-love.SAKURA.ne.jp>
> Date: Sat, 9 Jun 2018 22:47:52 +0900
> Subject: [PATCH] bdi: Fix another oops in wb_workfn()
>
> syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
> wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
> WB_shutting_down after wb->bdi->dev became NULL. This indicates that
> unregister_bdi() failed to call wb_shutdown() on one of wb objects.
>
> Since cgwb_bdi_unregister() from bdi_unregister() cannot call wb_shutdown()
> on wb objects which have already passed list_del_rcu() in wb_shutdown(),
> cgwb_bdi_unregister() from bdi_unregister() can return and set wb->bdi->dev
> to NULL before such wb objects enter final round of wb_workfn() via
> mod_delayed_work()/flush_delayed_work().

Thanks a lot for debugging the issue and also thanks a lot to Dmitry for
taking time to reproduce the race by hand with the debug patch! I really
appreciate it!

> Since WB_registered is already cleared by wb_shutdown(), only wb_shutdown()
> can schedule for final round of wb_workfn(). Since concurrent calls to
> wb_shutdown() on the same wb object is safe because of WB_shutting_down
> state, I think that wb_shutdown() can safely keep a wb object in the
> bdi->wb_list until that wb object leaves final round of wb_workfn().
> Thus, make wb_shutdown() call list_del_rcu() after flush_delayed_work().

However this is wrong and so is the patch. The problem is in
cgwb_bdi_unregister() which does cgwb_kill() and thus drops bdi's
reference to wb structures before going through the list of wbs again and
calling wb_shutdown() on each of them. The writeback structures we are
accessing at this point can be already freed in principle like:

CPU1 CPU2
cgwb_bdi_unregister()
cgwb_kill(*slot);

cgwb_release()
queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
wb = list_first_entry(&bdi->wb_list, ...)
spin_unlock_irq(&cgwb_lock);
wb_shutdown(wb);
...
kfree_rcu(wb, rcu);
wb_shutdown(wb); -> oops use-after-free

I'm not 100% sure how to fix this. wb structures can be at various phases of
shutdown (or there may be other external references still existing) when we
enter cgwb_bdi_unregister() so I think adding a way for cgwb_bdi_unregister()
to wait for standard wb shutdown path to finish is the most robust way.
What do you think about attached patch Tejun? So far only compile tested...

Possible problem with it is that now cgwb_bdi_unregister() will wait for
all wb references to be dropped so it adds some implicit dependencies to
bdi shutdown path.

Honza
0001-bdi-Fix-another-oops-in-wb_workfn.patch

Tejun Heo

unread,
Jun 11, 2018, 12:01:34 PM6/11/18
to Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
Hello,
Would something like the following work or am I missing the point
entirely?

Thanks.

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..359cacd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -715,14 +715,19 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
WARN_ON(test_bit(WB_registered, &bdi->wb.state));

spin_lock_irq(&cgwb_lock);
- radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
- cgwb_kill(*slot);
+ radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0) {
+ struct bdi_writeback *wb = *slot;
+
+ wb_get(wb);
+ cgwb_kill(wb);
+ }

while (!list_empty(&bdi->wb_list)) {
wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
bdi_node);
spin_unlock_irq(&cgwb_lock);
wb_shutdown(wb);
+ wb_put(wb);
spin_lock_irq(&cgwb_lock);
}
spin_unlock_irq(&cgwb_lock);


--
tejun

Jan Kara

unread,
Jun 11, 2018, 12:29:24 PM6/11/18
to Tejun Heo, Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
I was pondering the same solution for a while but I think it won't work.
The problem is that e.g. wb_memcg_offline() could have already removed
wb from the radix tree but it is still pending in bdi->wb_list
(wb_shutdown() has not run yet) and so we'd drop reference we didn't get.

Honza

Tejun Heo

unread,
Jun 11, 2018, 1:20:56 PM6/11/18
to Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
Hello,

On Mon, Jun 11, 2018 at 06:29:20PM +0200, Jan Kara wrote:
> > Would something like the following work or am I missing the point
> > entirely?
>
> I was pondering the same solution for a while but I think it won't work.
> The problem is that e.g. wb_memcg_offline() could have already removed
> wb from the radix tree but it is still pending in bdi->wb_list
> (wb_shutdown() has not run yet) and so we'd drop reference we didn't get.

Yeah, right, so the root cause is that we're walking the wb_list while
holding lock and expecting the object to stay there even after lock is
released. Hmm... we can use a mutex to synchronize the two
destruction paths. It's not like they're hot paths anyway.

Thanks.

--
tejun

Jan Kara

unread,
Jun 12, 2018, 11:57:56 AM6/12/18
to Tejun Heo, Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
Hmm, do you mean like having a per-bdi or even a global mutex that would
protect whole wb_shutdown()? Yes, that should work and we could get rid of
WB_shutting_down bit as well with that. Just it seems a bit strange to
introduce a mutex only to synchronize these two shutdown paths - usually
locks protect data structures and in this case we have cgwb_lock for
that so it looks like a duplication from a first look.

Honza

Tetsuo Handa

unread,
Jun 13, 2018, 6:44:35 AM6/13/18
to Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
Can't we utilize RCU grace period (like shown below) ?



If wb_shutdown(wb) by cgwb_release_workfn() was faster than wb_shutdown(wb) by cgwb_bdi_unregister():

cgwb_bdi_unregister(bdi) cgwb_release_workfn(work)

wb = container_of(work, struct bdi_writeback, release_work);
spin_lock_irq(&cgwb_lock);
wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Same wb here */
rcu_read_lock(); /* Prevent kfree_rcu() from invoking kfree() */
spin_unlock_irq(&cgwb_lock);
wb_shutdown(wb);
spin_lock_bh(&wb->work_lock);
!test_and_clear_bit(WB_registered, &wb->state) is "false".
set_bit(WB_shutting_down, &wb->state);
spin_unlock_bh(&wb->work_lock);
mod_delayed_work(bdi_wq, &wb->dwork, 0);
flush_delayed_work(&wb->dwork);
cgwb_remove_from_bdi_list(wb);
spin_lock_irq(&cgwb_lock);
list_del_rcu(&wb->bdi_node);
spin_unlock_irq(&cgwb_lock);
wb_exit(wb);
kfree_rcu(wb, rcu); /* Won't call kfree() because of rcu_read_lock() */
wb_shutdown(wb);
spin_lock_bh(&wb->work_lock); /* Safe to access because kfree() cannot be called */
!test_and_clear_bit(WB_registered, &wb->state) is "true".
spin_unlock_bh(&wb->work_lock);
rcu_read_unlock();
kfree(wb);
schedule_timeout_uninterruptible(HZ / 10); /* Try to wait in case list_del_rcu() is not yet called */
spin_lock_irq(&cgwb_lock);
wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Different wb if list_del_rcu() was already called, same wb otherwise */
rcu_read_lock(); /* Prevent kfree_rcu() from invoking kfree() if still same wb here */
spin_unlock_irq(&cgwb_lock);

If wb_shutdown(wb) by cgwb_bdi_unregister() was faster than wb_shutdown(wb) by cgwb_release_workfn():

cgwb_bdi_unregister(bdi) cgwb_release_workfn(work)

wb = container_of(work, struct bdi_writeback, release_work);
spin_lock_irq(&cgwb_lock);
wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Same wb here */
rcu_read_lock();
spin_unlock_irq(&cgwb_lock);
wb_shutdown(wb);
spin_lock_bh(&wb->work_lock);
!test_and_clear_bit(WB_registered, &wb->state) is "false".
set_bit(WB_shutting_down, &wb->state);
spin_unlock_bh(&wb->work_lock);
rcu_read_unlock();
mod_delayed_work(bdi_wq, &wb->dwork, 0);
flush_delayed_work(&wb->dwork);
cgwb_remove_from_bdi_list(wb);
spin_lock_irq(&cgwb_lock);
list_del_rcu(&wb->bdi_node);
spin_unlock_irq(&cgwb_lock);
spin_lock_irq(&cgwb_lock);
wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Different wb here */
rcu_read_lock();
spin_unlock_irq(&cgwb_lock);
wb_shutdown(wb);
spin_lock_bh(&wb->work_lock); /* Safe to access because kfree() cannot be called */
!test_and_clear_bit(WB_registered, &wb->state) is "true".
spin_unlock_bh(&wb->work_lock);
wb_exit(wb);
kfree_rcu(wb, rcu);



mm/backing-dev.c | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..6886cea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -353,13 +353,24 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
/*
* Remove bdi from the global list and shutdown any threads we have running
*/
-static void wb_shutdown(struct bdi_writeback *wb)
+static void wb_shutdown(struct bdi_writeback *wb, const bool in_rcu)
{
/* Make sure nobody queues further work */
spin_lock_bh(&wb->work_lock);
if (!test_and_clear_bit(WB_registered, &wb->state)) {
spin_unlock_bh(&wb->work_lock);
/*
+ * Try to wait for cgwb_remove_from_bdi_list() without
+ * using wait_on_bit(), for kfree_rcu(wb, rcu) from
+ * cgwb_release_workfn() might invoke kfree() before we
+ * recheck WB_shutting_down bit.
+ */
+ if (in_rcu) {
+ rcu_read_unlock();
+ schedule_timeout_uninterruptible(HZ / 10);
+ return;
+ }
+ /*
* Wait for wb shutdown to finish if someone else is just
* running wb_shutdown(). Otherwise we could proceed to wb /
* bdi destruction before wb_shutdown() is finished.
@@ -369,8 +380,9 @@ static void wb_shutdown(struct bdi_writeback *wb)
}
set_bit(WB_shutting_down, &wb->state);
spin_unlock_bh(&wb->work_lock);
+ if (in_rcu)
+ rcu_read_unlock();

- cgwb_remove_from_bdi_list(wb);
/*
* Drain work list and shutdown the delayed_work. !WB_registered
* tells wb_workfn() that @wb is dying and its work_list needs to
@@ -379,6 +391,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, &wb->dwork, 0);
flush_delayed_work(&wb->dwork);
WARN_ON(!list_empty(&wb->work_list));
+ cgwb_remove_from_bdi_list(wb);
/*
* Make sure bit gets cleared after shutdown is finished. Matches with
* the barrier provided by test_and_clear_bit() above.
@@ -508,7 +521,7 @@ static void cgwb_release_workfn(struct work_struct *work)
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
release_work);

- wb_shutdown(wb);
+ wb_shutdown(wb, false);

css_put(wb->memcg_css);
css_put(wb->blkcg_css);
@@ -721,8 +734,9 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
while (!list_empty(&bdi->wb_list)) {
wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
bdi_node);
+ rcu_read_lock();
spin_unlock_irq(&cgwb_lock);
- wb_shutdown(wb);
+ wb_shutdown(wb, true);
spin_lock_irq(&cgwb_lock);
}
spin_unlock_irq(&cgwb_lock);
@@ -944,7 +958,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
{
/* make sure nobody finds us on the bdi_list anymore */
bdi_remove_from_list(bdi);
- wb_shutdown(&bdi->wb);
+ wb_shutdown(&bdi->wb, false);
cgwb_bdi_unregister(bdi);

if (bdi->dev) {
--



Or, equivalent one but easier to read:



mm/backing-dev.c | 42 +++++++++++++++++++++++++++++++-----------
1 file changed, 31 insertions(+), 11 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..77088a3 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -350,15 +350,25 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,

static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);

-/*
- * Remove bdi from the global list and shutdown any threads we have running
- */
-static void wb_shutdown(struct bdi_writeback *wb)
+static bool wb_start_shutdown(struct bdi_writeback *wb)
{
/* Make sure nobody queues further work */
spin_lock_bh(&wb->work_lock);
if (!test_and_clear_bit(WB_registered, &wb->state)) {
spin_unlock_bh(&wb->work_lock);
+ return false;
+ }
+ set_bit(WB_shutting_down, &wb->state);
+ spin_unlock_bh(&wb->work_lock);
+ return true;
+}
+
+/*
+ * Remove bdi from the global list and shutdown any threads we have running
+ */
+static void wb_shutdown(struct bdi_writeback *wb, const bool started)
+{
+ if (!started && !wb_start_shutdown(wb)) {
/*
* Wait for wb shutdown to finish if someone else is just
* running wb_shutdown(). Otherwise we could proceed to wb /
@@ -367,10 +377,6 @@ static void wb_shutdown(struct bdi_writeback *wb)
wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
return;
}
- set_bit(WB_shutting_down, &wb->state);
- spin_unlock_bh(&wb->work_lock);
-
- cgwb_remove_from_bdi_list(wb);
/*
* Drain work list and shutdown the delayed_work. !WB_registered
* tells wb_workfn() that @wb is dying and its work_list needs to
@@ -379,6 +385,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, &wb->dwork, 0);
flush_delayed_work(&wb->dwork);
WARN_ON(!list_empty(&wb->work_list));
+ cgwb_remove_from_bdi_list(wb);
/*
* Make sure bit gets cleared after shutdown is finished. Matches with
* the barrier provided by test_and_clear_bit() above.
@@ -508,7 +515,7 @@ static void cgwb_release_workfn(struct work_struct *work)
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
release_work);

- wb_shutdown(wb);
+ wb_shutdown(wb, false);

css_put(wb->memcg_css);
css_put(wb->blkcg_css);
@@ -711,6 +718,7 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
struct radix_tree_iter iter;
void **slot;
struct bdi_writeback *wb;
+ bool started;

WARN_ON(test_bit(WB_registered, &bdi->wb.state));

@@ -721,8 +729,20 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
while (!list_empty(&bdi->wb_list)) {
wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
bdi_node);
+ rcu_read_lock();
spin_unlock_irq(&cgwb_lock);
- wb_shutdown(wb);
+ started = wb_start_shutdown(wb);
+ rcu_read_unlock();
+ if (started)
+ wb_shutdown(wb, true);
+ else
+ /*
+ * Try to wait for cgwb_remove_from_bdi_list() without
+ * using wait_on_bit(), for kfree_rcu(wb, rcu) from
+ * cgwb_release_workfn() might invoke kfree() before we
+ * recheck WB_shutting_down bit.
+ */
+ schedule_timeout_uninterruptible(HZ / 10);
spin_lock_irq(&cgwb_lock);
}
spin_unlock_irq(&cgwb_lock);
@@ -944,7 +964,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
{
/* make sure nobody finds us on the bdi_list anymore */
bdi_remove_from_list(bdi);
- wb_shutdown(&bdi->wb);
+ wb_shutdown(&bdi->wb, false);
cgwb_bdi_unregister(bdi);

if (bdi->dev) {
--



Or, toggling a dedicated flag using test_and_change_bit():



include/linux/backing-dev-defs.h | 1 +
mm/backing-dev.c | 8 +++++++-
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 0bd432a..93ff83c 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -26,6 +26,7 @@ enum wb_state {
WB_writeback_running, /* Writeback is in progress */
WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
WB_start_all, /* nr_pages == 0 (all) work pending */
+ WB_postpone_kfree, /* cgwb_bdi_unregister() will access later */
};

enum wb_congested_state {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc83..422d7a7 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -516,7 +516,10 @@ static void cgwb_release_workfn(struct work_struct *work)
fprop_local_destroy_percpu(&wb->memcg_completions);
percpu_ref_exit(&wb->refcnt);
wb_exit(wb);
- kfree_rcu(wb, rcu);
+ spin_lock_irq(&cgwb_lock);
+ if (!test_and_change_bit(WB_postpone_kfree, &wb->state))
+ kfree_rcu(wb, rcu);
+ spin_unlock_irq(&cgwb_lock);
}

static void cgwb_release(struct percpu_ref *refcnt)
@@ -721,9 +724,12 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
while (!list_empty(&bdi->wb_list)) {
wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
bdi_node);
+ set_bit(WB_postpone_kfree, &wb->state);
spin_unlock_irq(&cgwb_lock);
wb_shutdown(wb);
spin_lock_irq(&cgwb_lock);
+ if (!test_and_change_bit(WB_postpone_kfree, &wb->state))
+ kfree_rcu(wb, rcu);
}
spin_unlock_irq(&cgwb_lock);
}
--

Tetsuo Handa

unread,
Jun 13, 2018, 7:51:46 AM6/13/18
to Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
Forgot to include below change, but isn't this approach the simplest?

@@ -370,7 +370,6 @@ static void wb_shutdown(struct bdi_writeback *wb)
set_bit(WB_shutting_down, &wb->state);
spin_unlock_bh(&wb->work_lock);

- cgwb_remove_from_bdi_list(wb);
/*
* Drain work list and shutdown the delayed_work. !WB_registered
* tells wb_workfn() that @wb is dying and its work_list needs to
@@ -379,6 +378,7 @@ static void wb_shutdown(struct bdi_writeback *wb)
mod_delayed_work(bdi_wq, &wb->dwork, 0);
flush_delayed_work(&wb->dwork);
WARN_ON(!list_empty(&wb->work_list));
+ cgwb_remove_from_bdi_list(wb);
/*
* Make sure bit gets cleared after shutdown is finished. Matches with
* the barrier provided by test_and_clear_bit() above.
--

Linus Torvalds

unread,
Jun 13, 2018, 10:06:56 AM6/13/18
to Tetsuo Handa, Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot+4a7438...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, linux-fsdevel, Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block
On Wed, Jun 13, 2018 at 3:44 AM Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
>
> Can't we utilize RCU grace period (like shown below) ?

_Please_ don't do conditional locking. Particularly not the kind where
then a function drops the lock that was taken in another function -
and only does it based on some parameter.

Yes, we have cases where we do it, but it's seldom a great idea, and
it _really_ makes for subtle code (and subtle bugs).

Linus

Tejun Heo

unread,
Jun 13, 2018, 10:33:19 AM6/13/18
to Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
Hello, Jan.

On Tue, Jun 12, 2018 at 05:57:54PM +0200, Jan Kara wrote:
> > Yeah, right, so the root cause is that we're walking the wb_list while
> > holding lock and expecting the object to stay there even after lock is
> > released. Hmm... we can use a mutex to synchronize the two
> > destruction paths. It's not like they're hot paths anyway.
>
> Hmm, do you mean like having a per-bdi or even a global mutex that would
> protect whole wb_shutdown()? Yes, that should work and we could get rid of
> WB_shutting_down bit as well with that. Just it seems a bit strange to

Yeap.

> introduce a mutex only to synchronize these two shutdown paths - usually
> locks protect data structures and in this case we have cgwb_lock for
> that so it looks like a duplication from a first look.

Yeah, I feel a bit reluctant too but I think that's the right thing to
do here. This is an inherently weird case where there are two ways
that an object can go away with the immediate drain requirement from
one side. It's not a hot path and the dumber the synchronization the
better, right?

Thanks.

--
tejun

Jan Kara

unread,
Jun 13, 2018, 10:46:10 AM6/13/18
to Tetsuo Handa, Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
On Wed 13-06-18 19:43:47, Tetsuo Handa wrote:
> Can't we utilize RCU grace period (like shown below) ?

Honestly, the variant 1 looks too ugly to me. However variant 2 looks
mostly OK. We can also avoid the schedule_timeout_uninterruptible(HZ / 10)
from your patch by careful handling of the bit waitqueues. Also I'd avoid
the addition argument to wb_writeback() and split the function instead. The
patch resulting from your and mine ideas is attached. Thoughts?
0001-bdi-Fix-another-oops-in-wb_workfn.patch

Linus Torvalds

unread,
Jun 13, 2018, 10:55:44 AM6/13/18
to Jan Kara, Tetsuo Handa, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot+4a7438...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, linux-fsdevel, Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block
On Wed, Jun 13, 2018 at 7:46 AM Jan Kara <ja...@suse.cz> wrote:
>
> On Wed 13-06-18 19:43:47, Tetsuo Handa wrote:
> > Can't we utilize RCU grace period (like shown below) ?
>
> Honestly, the variant 1 looks too ugly to me. However variant 2 looks
> mostly OK.

The versions that don't have that conditional locking look fine to me, yes.

> Also I'd avoid the addition argument to wb_writeback() and split the function instead. The
> patch resulting from your and mine ideas is attached. Thoughts?

Is there a reason for this model:

+ if (cgwb_start_shutdown(wb))
+ __wb_shutdown(wb);

when there is just one call site of this? Why not just make the
function void, and make it do that __wb_shutdown() itself in the true
case?

IOW, just make it be

+ cgwb_shutdown(wb);

instead?

That's what "wb_shutdown()" does - it does the "wb_start_shutdown()"
test internally, and does __wb_shutdown() all inside itself, instead
of expecting the caller to do it.

I dunno.

Linus

Linus

Tetsuo Handa

unread,
Jun 13, 2018, 12:20:56 PM6/13/18
to Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
+static bool cgwb_start_shutdown(struct bdi_writeback *wb)
+ __releases(cgwb_lock)
+{
+ if (!wb_start_shutdown(wb)) {
+ DEFINE_WAIT(wait);
+ wait_queue_head_t *wqh = bit_waitqueue(&wb->state,
+ WB_shutting_down);
+ bool sleep;
+
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ sleep = test_bit(WB_shutting_down, &wb->state);
+ spin_unlock_irq(&cgwb_lock);
+ if (sleep)
+ schedule();
+ return false;
+ }
+ spin_unlock_irq(&cgwb_lock);
+ return true;
+}

Since multiple addresses share bit_wait_table[256], isn't it possible that
cgwb_start_shutdown() prematurely returns false due to wake_up_bit() by
hash-conflicting addresses (i.e. not limited to clear_and_wake_up_bit() from
wb_shutdown())? I think that we cannot be sure without confirming that
test_bit(WB_shutting_down, &wb->state) == false after returning from schedule().

Linus Torvalds

unread,
Jun 13, 2018, 12:25:15 PM6/13/18
to Tetsuo Handa, Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot+4a7438...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, linux-fsdevel, Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block
On Wed, Jun 13, 2018 at 9:21 AM Tetsuo Handa
<penguin...@i-love.sakura.ne.jp> wrote:
>
> Since multiple addresses share bit_wait_table[256], isn't it possible that
> cgwb_start_shutdown() prematurely returns false due to wake_up_bit() by
> hash-conflicting addresses (i.e. not limited to clear_and_wake_up_bit() from
> wb_shutdown())? I think that we cannot be sure without confirming that
> test_bit(WB_shutting_down, &wb->state) == false after returning from schedule().

Right.

That's _always_ true, btw. Something else entirely could have woken
you up. TASK_UNINTERRUPTIBLE does not mean "nothing else wakes me", it
just means "_signals_ don't wake me".

So every single sleep always needs to be in a loop. Always.

Linus

Jan Kara

unread,
Jun 13, 2018, 12:45:14 PM6/13/18
to Linus Torvalds, Tetsuo Handa, Jan Kara, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot+4a7438...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, linux-fsdevel, Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block
Agreed and in my patch it actually is in a loop - the one iterating the
list of active writeback structures. If we get a false wakeup, we find the
same structure in the list again and wait again...

Tetsuo Handa

unread,
Jun 13, 2018, 5:04:50 PM6/13/18
to Jan Kara, Linus Torvalds, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot+4a7438...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, linux-fsdevel, Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block
Indeed. I overlooked that wb = list_first_entry() will select same wb again
if cgwb_remove_from_bdi_list() is not yet called. Well, we could update
"(in which case we also wait for it to finish)" part or move the body of
cgwb_start_shutdown() to cgwb_bdi_unregister() so that it becomes clear
that false wake-up is not a problem in this case.

Jan Kara

unread,
Jun 14, 2018, 6:11:08 AM6/14/18
to Tetsuo Handa, Jan Kara, Linus Torvalds, Tejun Heo, Dmitry Vyukov, Jens Axboe, syzbot+4a7438...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, linux-fsdevel, Linux Kernel Mailing List, Al Viro, Dave Chinner, linux-block
I prefer to keep the wb shutdown in a separate function but I've added some
comments to explain that.

Jan Kara

unread,
Jun 15, 2018, 8:06:23 AM6/15/18
to Tejun Heo, Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
Yeah, fair enough. Something like attached patch? It is indeed considerably
simpler than fixing synchronization using WB_shutting_down. This one even
got some testing using scsi_debug, I want to do more testing next week with
more cgroup writeback included.
0001-bdi-Fix-another-oops-in-wb_workfn.patch

Jan Kara

unread,
Jun 18, 2018, 8:27:32 AM6/18/18
to Tejun Heo, Jan Kara, Tetsuo Handa, Dmitry Vyukov, Jens Axboe, syzbot, syzkaller-bugs, linux-fsdevel, LKML, Al Viro, Dave Chinner, linux...@vger.kernel.org, Linus Torvalds
OK, the test has passed some beating with cgroup writeback running. I'll do
official posting shortly.
Reply all
Reply to author
Forward
0 new messages