fs: GPF in locked_inode_to_wb_and_lock_list

130 views
Skip to first unread message

Dmitry Vyukov

unread,
Apr 18, 2016, 5:44:33 AM4/18/16
to Alexander Viro, linux-...@vger.kernel.org, LKML, Tejun Heo, syzkaller, Kostya Serebryany, Alexander Potapenko
Hello,

I've hit the following GPF while running syzkaller fuzzer on commit
806fdcce017dc98c4dbf8ed001750a0d7d2bb0af (Apr 14):

general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
Modules linked in:
CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-11:0)
task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>]
locked_inode_to_wb_and_lock_list+0xa2/0x750
RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Stack:
ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
Call Trace:
[< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
[<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
[<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
[<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
[< inline >] wb_do_writeback fs/fs-writeback.c:1844
[<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
[<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
[<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
[<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
[<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212
RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
fs/fs-writeback.c:281
RSP <ffff88006cdaf7d0>
---[ end trace 986a4d314dcb2694 ]---


The crash happened here:

if (wb != &wb->bdi->wb)
ffffffff818884cb: 48 89 d8 mov %rbx,%rax
ffffffff818884ce: 48 c1 e8 03 shr $0x3,%rax
ffffffff818884d2: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1)
ffffffff818884d7: 0f 85 17 06 00 00 jne
ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4>
ffffffff818884dd: 48 8b 03 mov (%rbx),%rax
ffffffff818884e0: 48 83 c0 50 add $0x50,%rax
ffffffff818884e4: 48 39 c3 cmp %rax,%rbx
ffffffff818884e7: 0f 84 c3 00 00 00 je
ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180>

Which means that bdi is NULL (if I get indirections right).

The block device is flush-11:0 (/dev/sr0).

I was able to reproduce it once by re-running programs from the crash log:
https://gist.githubusercontent.com/dvyukov/7c9e6358b661e2d50a774a70b5ee348a/raw/5cd94e32bbcf2a788b5efd0768cfd883c5921070/gistfile1.txt
But it took half an hour, and then another hour long run did not
trigger it. So it is probably triggered by some rare race.

Thank you

Tejun Heo

unread,
Apr 20, 2016, 5:14:58 PM4/20/16
to Dmitry Vyukov, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko
Hello, Dmitry.

On Mon, Apr 18, 2016 at 11:44:11AM +0200, Dmitry Vyukov wrote:

> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
...
Man, that's a beautiful trace w/ decoding of inline functions. When
did we start doing that? Is there a specific config option for this?

> ---[ end trace 986a4d314dcb2694 ]---
> The crash happened here:
>
> if (wb != &wb->bdi->wb)
> ffffffff818884cb: 48 89 d8 mov %rbx,%rax
> ffffffff818884ce: 48 c1 e8 03 shr $0x3,%rax
> ffffffff818884d2: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1)

So, it's the above instruction.

> ffffffff818884d7: 0f 85 17 06 00 00 jne
> ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4>
> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax
> ffffffff818884e0: 48 83 c0 50 add $0x50,%rax
> ffffffff818884e4: 48 39 c3 cmp %rax,%rbx
> ffffffff818884e7: 0f 84 c3 00 00 00 je
> ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180>
>
> Which means that bdi is NULL (if I get indirections right).

So, the wb != &wb->bdi->wb comparison would be the cmp at
0xffffffff818884e4 and given that it just compares the address of
&bdi->wb, bdi being NULL wouldn't trigger the fault.

cmpb $0x0,(%rax,%r13,1)
-> *(u8 *)(%rax + %r13) == 0
-> *(u8 *)((%rbx >> 3) + %r13) == 0

Where can that be from? I can't find anything matching even in the
surrounding functions.

Hmmm... The base address %r13 is 0xdffffc0000000000 which isn't a
proper canonical address and in general suspcious. Ooh, it's
KASAN_SHADOW_OFFSET. It looks like something is making KASAN trigger
a fault. Can we please bring in someone who's more familiar with
KASAN?

Thanks.

--
tejun

Dmitry Vyukov

unread,
Apr 21, 2016, 4:25:32 AM4/21/16
to Tejun Heo, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko
Thanks! :)
I use this script for symbolization:
https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py
It invokes addr2line to provide file:line info, adds inline frames,
strips ? frames (are they ever useful?) and strips timestamps.
You just need to pipe a report through it. It assumes that vmlinux is
in the current dir, but you can override it with --linux flag.

It can also insert actual source lines into a report (amount of
context is controlled with flags), for example:


Call Trace:
[< inline >] khugepaged_scan_pmd include/linux/uaccess.h:9
8 {
9 current->pagefault_disabled++;
10 }
[< inline >] khugepaged_scan_mm_slot mm/huge_memory.c:2716
2715 hend);
2716 ret = khugepaged_scan_pmd(mm, vma,
2717
khugepaged_scan.address,
[< inline >] khugepaged_do_scan mm/huge_memory.c:2796
2795 pass_through_head < 2)
2796 progress +=
khugepaged_scan_mm_slot(pages - progress,
2797 &hpage);
[<ffffffff81790b33>] khugepaged+0x993/0x48e0 mm/huge_memory.c:2831
2830 while (!kthread_should_stop()) {
2831 khugepaged_do_scan();
2832 khugepaged_wait_work();
[<ffffffff813c195f>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
1302 add_wait_queue(k->waitq, &wait);
1303 __set_current_state(TASK_INTERRUPTIBLE);
1304 }
[<ffffffff866d1b2f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:405
404 SWAPGS
405 jmp restore_regs_and_iret
406 END(ret_from_fork)

Dmitry Vyukov

unread,
Apr 21, 2016, 4:35:44 AM4/21/16
to Tejun Heo, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko
On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <t...@kernel.org> wrote:
I am here.
For every memory access to ADDR, KASAN makes a byte load from
KASAN_SHADOW_OFFSET+ADDR>>8 first.
For accesses to kernel memory, the byte is addressable and contains
addressability state of the ADDR.
It also has a positive side effect of catching any accesses to user
memory at the place of occurrence as the address computations produces
a non-valid address.
But the negative side effect that instead of a usual NULL-deref GPF,
you now get what we have here. %r13 contains KASAN_SHADOW_OFFSET, and
%rax contains the address that normal code is going to dereference few
instructions later. So without KASAN the code would trigger NULL deref
on (%rbx is a copy of %rax):

ffffffff818884dd: 48 8b 03 mov (%rbx),%rax

So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
that is NULL?)

Sorry for this mess. It is a known issue in KASAN, but we don't know
how to fix it without slowing down execution and sacrificing other
properties.

Andrey Ryabinin

unread,
Apr 21, 2016, 5:10:29 AM4/21/16
to Dmitry Vyukov, Tejun Heo, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
BTW, it would be nice to have that script in kernel tree, instead of
scripts/decode_stacktrace.sh which
is far less advanced.

Dmitry Vyukov

unread,
Apr 21, 2016, 5:29:38 AM4/21/16
to Andrey Ryabinin, Tejun Heo, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Andrey Konovalov
I am all for it!
We have 2 minor, unresolved bugs in the script:
https://github.com/google/sanitizers/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+kasan_symbolize
It would be good to resolve them before submitting it to kernel tree.
But I am not particularly experienced in Python. +Andrey Konovalov
will be able to do it in autumn. If anybody is willing to take it
earlier, you are welcome (please drop a note to
https://github.com/google/sanitizers/issues/674 to coordinate
efforts).

Andrey Ryabinin

unread,
Apr 21, 2016, 5:45:22 AM4/21/16
to Dmitry Vyukov, Tejun Heo, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko
2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvy...@google.com>:
>
> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax
>
> So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb
> that is NULL?)

Yes it's NULL wb, because there is only one load:
mov (%rbx),%rax => rax = wb->bdi
add $0x50,%rax => rax = &bdi->wb

Dmitry Vyukov

unread,
Apr 21, 2016, 6:00:58 AM4/21/16
to Andrey Ryabinin, Tejun Heo, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko
I bet that wb becomes NULL on the second iteration of the loop. The
loop loops in case of a race with another thread, so it would also
explain why it is difficult to reproduce.

Tejun, does it make any sense to you?

Tejun Heo

unread,
Apr 21, 2016, 12:14:02 PM4/21/16
to Dmitry Vyukov, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko
Hello,

On Thu, Apr 21, 2016 at 10:25:12AM +0200, Dmitry Vyukov wrote:
> I use this script for symbolization:
> https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py
> It invokes addr2line to provide file:line info, adds inline frames,
> strips ? frames (are they ever useful?) and strips timestamps.

Heh, sometimes they are. They obviously are useful when backlinks are
broken for some reason. Through the remnants of the old call stacks,
they also can sometimes tell the history of the task - where it was
before it got to the current backtrace. It isn't reliable at all but
I can recall several occasions where they were useful in root causing
issues. It *might* make a sense to have an option to not strip them.

And, yeah, I'm all for it being in the kernel tree.

Thanks.

--
tejun

Tejun Heo

unread,
Apr 21, 2016, 1:06:38 PM4/21/16
to Dmitry Vyukov, Andrey Ryabinin, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jens Axboe, Jan Kara
Hello,

(cc'ing Ilya, Jan and Jens)
Yeah, that makes sense. I think the culprit is 43d1c0eb7e11 ("block:
detach bdev inode from its wb in __blkdev_put()") which allows inode
to wb association to be broken while other operations including
writeback are in progress. I thought it should be okay as the inode
must be clean at that point but that obviously doesn't mean that there
can be no writeback operations in flight.

I hope we could eventually move away from the current model where we
try to swap out an underlying data structure while upper layers may
still be referring to it in the future but for now we can make sure
the writeback operation is finished before detaching wb.

Dmitry, I understand that the bug is difficult to reproduce but can
you please give the following patch a try?

Thanks.

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 20a2c02..209ea33 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1530,12 +1530,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
kill_bdev(bdev);

bdev_write_inode(bdev);
- /*
- * Detaching bdev inode from its wb in __destroy_inode()
- * is too late: the queue which embeds its bdi (along with
- * root wb) can be gone as soon as we put_disk() below.
- */
- inode_detach_wb(bdev->bd_inode);
+ inode_detach_blkdev_wb(bdev);
}
if (bdev->bd_contains == bdev) {
if (disk->fops->release)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d0b5ca5..ec1f530 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode)
}

/**
+ * inode_detach_blkdev_wb - disassociate a bd_inode from its wb
+ * @bdev: block_device of interest
+ *
+ * @bdev is being put for the last time. Detaching bdev inode in
+ * __destroy_inode() is too late: the queue which embeds its bdi (along
+ * with root wb) can be gone as soon as the containing disk is put.
+ *
+ * This function dissociates @bdev->bd_inode from its wb. The inode must
+ * be clean and no further operations should be started on it.
+ */
+static inline void inode_detach_blkdev_wb(struct block_device *bdev)
+{
+ if (bdev->bd_inode->i_wb) {
+ flush_delayed_work(&bdev->bd_inode->i_wb->dwork);
+ inode_detach_wb(bdev->bd_inode);
+ }
+}
+
+/**
* wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite
* @wbc: writeback_control of interest
* @inode: target inode
@@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode)
{
}

+static inline void inode_detach_blkdev_wb(struct block_device *bdev)
+{
+}
+
static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
struct inode *inode)
__releases(&inode->i_lock)

Dmitry Vyukov

unread,
Apr 22, 2016, 2:55:47 PM4/22/16
to Tejun Heo, Andrey Ryabinin, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jens Axboe, Jan Kara
I've merged it into my tree and will restart the fuzzer and leave it
running for the weekend.
Though, yeah, it is difficult to reproduce...

Dmitry Vyukov

unread,
Jun 6, 2016, 1:46:32 PM6/6/16
to Tejun Heo, Andrey Ryabinin, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jens Axboe, Jan Kara
Hi Tejun,

I have not seem the crash since then. Please mail the patch.

Tejun Heo

unread,
Jun 17, 2016, 12:04:09 PM6/17/16
to Jens Axboe, Andrey Ryabinin, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jan Kara, Dmitry Vyukov, kerne...@fb.com
43d1c0eb7e11 ("block: detach bdev inode from its wb in
__blkdev_put()") detached bdev inode from its wb as the bdev inode may
outlive the underlying bdi and thus the wb. This is accomplished by
invoking inode_detach_wb() from __blkdev_put(); however, while the
inode can't be dirtied by the time control reaches there, that doesn't
guarantee that writeback isn't already in progress on the inode. This
can lead to the inode being disassociated from its wb while writeback
operation is in flight causing oopses like the following.

general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
Modules linked in:
CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-11:0)
task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
---[ end trace 986a4d314dcb2694 ]---

Fix it by flushing the wb dwork before detaching the inode. Combined
with the fact that the inode can no longer be dirtied, this guarantees
that no writeback operation can be in flight or initiated. As this
involves details on writeback side which doesn't have much to do with
block_dev, encapsulate it in a helper function -
inode_detach_blkdev_wb().

Signed-off-by: Tejun Heo <t...@kernel.org>
Reported-and-tested-by: Dmitry Vyukov <dvy...@google.com>
Link: http://lkml.kernel.org/g/CACT4Y+YAjq8mcfiVxR075did...@mail.gmail.com
Fixes: 43d1c0eb7e11 ("block: detach bdev inode from its wb in __blkdev_put()")
Cc: Ilya Dryomov <idry...@gmail.com>
Cc: sta...@vger.kernel.org # v4.2+
---
fs/block_dev.c | 7 +------
include/linux/writeback.h | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 71ccab1..6a14100 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1612,12 +1612,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)

Jan Kara

unread,
Jun 20, 2016, 9:31:50 AM6/20/16
to Tejun Heo, Jens Axboe, Andrey Ryabinin, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jan Kara, Dmitry Vyukov, kerne...@fb.com, Ted Tso
Hi,

On Fri 17-06-16 12:04:05, Tejun Heo wrote:
> 43d1c0eb7e11 ("block: detach bdev inode from its wb in
> __blkdev_put()") detached bdev inode from its wb as the bdev inode may
> outlive the underlying bdi and thus the wb. This is accomplished by
> invoking inode_detach_wb() from __blkdev_put(); however, while the
> inode can't be dirtied by the time control reaches there, that doesn't
> guarantee that writeback isn't already in progress on the inode. This
> can lead to the inode being disassociated from its wb while writeback
> operation is in flight causing oopses like the following.

<snip>

> Fix it by flushing the wb dwork before detaching the inode. Combined
> with the fact that the inode can no longer be dirtied, this guarantees
> that no writeback operation can be in flight or initiated.

Sorry for the late reply but now when thinking about the patch I don't
think it is quite right. Writeback can happen from other contexts than just
the worker one (e.g. kswapd can do writeback, or in some out-of-memory
situations we may punt to doing writeback directly instead of calling the
worker, or sync(2) calls fdatawrite() for block device inode directly when
iterating through blockdev superblock). So flushing the workqueue IMHO is
not covering 100% of cases.

I wanted to suggest to use inode_wait_for_writeback() which will make sure
writeback is done with the inode. However we effectively already do this
by calling bdev_write_inode() which calls writeback_single_inode() in
WB_SYNC_ALL mode. So by the time that call completes we are sure writeback
code is not looking at the inode. *However* what I think is happening is
that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes
the inode and clears all dirty bits, however the inode still stays in the
b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently
once flusher work runs, it finds the inode, looks at it and boom. And the
problem seems to be that write_inode_now(inode, true) does not result in
I_DIRTY_TIME being cleared.

Attached patch should fix this issue - it is compile-tested only. Dmitry,
can you check whether this patch fixes the issue for you as well?

Honza
--
Jan Kara <ja...@suse.com>
SUSE Labs, CR
0001-writeback-Write-dirty-times-for-WB_SYNC_ALL-writebac.patch

Dmitry Vyukov

unread,
Jun 20, 2016, 9:39:06 AM6/20/16
to Jan Kara, Tejun Heo, Jens Axboe, Andrey Ryabinin, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jan Kara, kerne...@fb.com, Ted Tso
I can't directly test it because crash happened very infrequently.
If Tejun/Ted agree that it is the right way to fix it, then I can
patch it, restart the fuzzer and leave it running for a while.

Tejun Heo

unread,
Jun 20, 2016, 1:40:51 PM6/20/16
to Dmitry Vyukov, Jan Kara, Jens Axboe, Andrey Ryabinin, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jan Kara, kerne...@fb.com, Ted Tso
Hello,

On Mon, Jun 20, 2016 at 03:38:41PM +0200, Dmitry Vyukov wrote:
> > Sorry for the late reply but now when thinking about the patch I don't
> > think it is quite right. Writeback can happen from other contexts than just
> > the worker one (e.g. kswapd can do writeback, or in some out-of-memory
> > situations we may punt to doing writeback directly instead of calling the
> > worker, or sync(2) calls fdatawrite() for block device inode directly when
> > iterating through blockdev superblock). So flushing the workqueue IMHO is
> > not covering 100% of cases.

Hmmm, yeah, the patch undoes what the cgroup writeback changes added
but it looks like the addition was exposing the existing problem more
rather than causing a new one.

> > I wanted to suggest to use inode_wait_for_writeback() which will make sure
> > writeback is done with the inode. However we effectively already do this
> > by calling bdev_write_inode() which calls writeback_single_inode() in
> > WB_SYNC_ALL mode. So by the time that call completes we are sure writeback
> > code is not looking at the inode. *However* what I think is happening is
> > that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes
> > the inode and clears all dirty bits, however the inode still stays in the
> > b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently
> > once flusher work runs, it finds the inode, looks at it and boom. And the
> > problem seems to be that write_inode_now(inode, true) does not result in
> > I_DIRTY_TIME being cleared.
> >
> > Attached patch should fix this issue - it is compile-tested only. Dmitry,
> > can you check whether this patch fixes the issue for you as well?
>
> I can't directly test it because crash happened very infrequently.
> If Tejun/Ted agree that it is the right way to fix it, then I can
> patch it, restart the fuzzer and leave it running for a while.

Yes, please try out the patch.

Thanks a lot.

--
tejun

Dmitry Vyukov

unread,
Jun 21, 2016, 8:58:33 AM6/21/16
to Tejun Heo, Jan Kara, Jens Axboe, Andrey Ryabinin, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Ilya Dryomov, Jan Kara, kerne...@fb.com, Ted Tso
Done
At least it should catch any new introduced bugs.
Reply all
Reply to author
Forward
0 new messages