| fs: GPF in locked_inode_to_wb_and_lock_list | Dmitry Vyukov | 18/04/16 02:44 ص | Hello,
I've hit the following GPF while running syzkaller fuzzer on commit 806fdcce017dc98c4dbf8ed001750a0d7d2bb0af (Apr 14): general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN Modules linked in: CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-11:0) task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000 RIP: 0010:[<ffffffff818884d2>] [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050 RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286 RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000 R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0 R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40 FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0 DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Stack: ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948 ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948 ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e Call Trace: [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309 [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554 [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600 [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709 [< inline >] wb_do_writeback fs/fs-writeback.c:1844 [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884 [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094 [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228 [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392 Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e 00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3 RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212 RIP [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750 fs/fs-writeback.c:281 RSP <ffff88006cdaf7d0> ---[ end trace 986a4d314dcb2694 ]--- The crash happened here: if (wb != &wb->bdi->wb) ffffffff818884cb: 48 89 d8 mov %rbx,%rax ffffffff818884ce: 48 c1 e8 03 shr $0x3,%rax ffffffff818884d2: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1) ffffffff818884d7: 0f 85 17 06 00 00 jne ffffffff81888af4 <locked_inode_to_wb_and_lock_list+0x6c4> ffffffff818884dd: 48 8b 03 mov (%rbx),%rax ffffffff818884e0: 48 83 c0 50 add $0x50,%rax ffffffff818884e4: 48 39 c3 cmp %rax,%rbx ffffffff818884e7: 0f 84 c3 00 00 00 je ffffffff818885b0 <locked_inode_to_wb_and_lock_list+0x180> Which means that bdi is NULL (if I get indirections right). The block device is flush-11:0 (/dev/sr0). I was able to reproduce it once by re-running programs from the crash log: https://gist.githubusercontent.com/dvyukov/7c9e6358b661e2d50a774a70b5ee348a/raw/5cd94e32bbcf2a788b5efd0768cfd883c5921070/gistfile1.txt But it took half an hour, and then another hour long run did not trigger it. So it is probably triggered by some rare race. Thank you |
| Re: fs: GPF in locked_inode_to_wb_and_lock_list | Tejun Heo | 20/04/16 02:14 م | Hello, Dmitry.
... Man, that's a beautiful trace w/ decoding of inline functions. When did we start doing that? Is there a specific config option for this? So, it's the above instruction. So, the wb != &wb->bdi->wb comparison would be the cmp at 0xffffffff818884e4 and given that it just compares the address of &bdi->wb, bdi being NULL wouldn't trigger the fault. cmpb $0x0,(%rax,%r13,1) -> *(u8 *)(%rax + %r13) == 0 -> *(u8 *)((%rbx >> 3) + %r13) == 0 Where can that be from? I can't find anything matching even in the surrounding functions. Hmmm... The base address %r13 is 0xdffffc0000000000 which isn't a proper canonical address and in general suspcious. Ooh, it's KASAN_SHADOW_OFFSET. It looks like something is making KASAN trigger a fault. Can we please bring in someone who's more familiar with KASAN? Thanks. -- tejun |
| Re: fs: GPF in locked_inode_to_wb_and_lock_list | Dmitry Vyukov | 21/04/16 01:25 ص | Thanks! :)
I use this script for symbolization: https://github.com/google/sanitizers/blob/master/address-sanitizer/tools/kasan_symbolize.py It invokes addr2line to provide file:line info, adds inline frames, strips ? frames (are they ever useful?) and strips timestamps. You just need to pipe a report through it. It assumes that vmlinux is in the current dir, but you can override it with --linux flag. It can also insert actual source lines into a report (amount of context is controlled with flags), for example: Call Trace: [< inline >] khugepaged_scan_pmd include/linux/uaccess.h:9 8 { 9 current->pagefault_disabled++; 10 } [< inline >] khugepaged_scan_mm_slot mm/huge_memory.c:2716 2715 hend); 2716 ret = khugepaged_scan_pmd(mm, vma, 2717 khugepaged_scan.address, [< inline >] khugepaged_do_scan mm/huge_memory.c:2796 2795 pass_through_head < 2) 2796 progress += khugepaged_scan_mm_slot(pages - progress, 2797 &hpage); [<ffffffff81790b33>] khugepaged+0x993/0x48e0 mm/huge_memory.c:2831 2830 while (!kthread_should_stop()) { 2831 khugepaged_do_scan(); 2832 khugepaged_wait_work(); [<ffffffff813c195f>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303 1302 add_wait_queue(k->waitq, &wait); 1303 __set_current_state(TASK_INTERRUPTIBLE); 1304 } [<ffffffff866d1b2f>] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:405 404 SWAPGS 405 jmp restore_regs_and_iret 406 END(ret_from_fork) |
| Re: fs: GPF in locked_inode_to_wb_and_lock_list | Dmitry Vyukov | 21/04/16 01:35 ص | On Wed, Apr 20, 2016 at 11:14 PM, Tejun Heo <t...@kernel.org> wrote:I am here. For every memory access to ADDR, KASAN makes a byte load from KASAN_SHADOW_OFFSET+ADDR>>8 first. For accesses to kernel memory, the byte is addressable and contains addressability state of the ADDR. It also has a positive side effect of catching any accesses to user memory at the place of occurrence as the address computations produces a non-valid address. But the negative side effect that instead of a usual NULL-deref GPF, you now get what we have here. %r13 contains KASAN_SHADOW_OFFSET, and %rax contains the address that normal code is going to dereference few instructions later. So without KASAN the code would trigger NULL deref on (%rbx is a copy of %rax): So whatever load "&wb->bdi->wb" produces is a NULL deref. (is it wb that is NULL?) Sorry for this mess. It is a known issue in KASAN, but we don't know how to fix it without slowing down execution and sacrificing other properties. |
| Re: fs: GPF in locked_inode_to_wb_and_lock_list | Andrey Ryabinin | 21/04/16 02:10 ص | BTW, it would be nice to have that script in kernel tree, instead of
scripts/decode_stacktrace.sh which is far less advanced. |
| Re: fs: GPF in locked_inode_to_wb_and_lock_list | Dmitry Vyukov | 21/04/16 02:29 ص | I am all for it!
We have 2 minor, unresolved bugs in the script: https://github.com/google/sanitizers/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+kasan_symbolize It would be good to resolve them before submitting it to kernel tree. But I am not particularly experienced in Python. +Andrey Konovalov will be able to do it in autumn. If anybody is willing to take it earlier, you are welcome (please drop a note to https://github.com/google/sanitizers/issues/674 to coordinate efforts). |
| Re: fs: GPF in locked_inode_to_wb_and_lock_list | Andrey Ryabinin | 21/04/16 02:45 ص | 2016-04-21 11:35 GMT+03:00 Dmitry Vyukov <dvy...@google.com>:Yes it's NULL wb, because there is only one load: mov (%rbx),%rax => rax = wb->bdi add $0x50,%rax => rax = &bdi->wb |
| Re: fs: GPF in locked_inode_to_wb_and_lock_list | Dmitry Vyukov | 21/04/16 03:00 ص | I bet that wb becomes NULL on the second iteration of the loop. The
loop loops in case of a race with another thread, so it would also explain why it is difficult to reproduce. Tejun, does it make any sense to you? |
| Re: fs: GPF in locked_inode_to_wb_and_lock_list | Tejun Heo | 21/04/16 09:14 ص | Hello,
Heh, sometimes they are. They obviously are useful when backlinks are broken for some reason. Through the remnants of the old call stacks, they also can sometimes tell the history of the task - where it was before it got to the current backtrace. It isn't reliable at all but I can recall several occasions where they were useful in root causing issues. It *might* make a sense to have an option to not strip them. And, yeah, I'm all for it being in the kernel tree. Thanks. -- tejun |
| Re: fs: GPF in locked_inode_to_wb_and_lock_list | Tejun Heo | 21/04/16 10:06 ص | Hello,
(cc'ing Ilya, Jan and Jens) Yeah, that makes sense. I think the culprit is 43d1c0eb7e11 ("block: detach bdev inode from its wb in __blkdev_put()") which allows inode to wb association to be broken while other operations including writeback are in progress. I thought it should be okay as the inode must be clean at that point but that obviously doesn't mean that there can be no writeback operations in flight. I hope we could eventually move away from the current model where we try to swap out an underlying data structure while upper layers may still be referring to it in the future but for now we can make sure the writeback operation is finished before detaching wb. Dmitry, I understand that the bug is difficult to reproduce but can you please give the following patch a try? Thanks. diff --git a/fs/block_dev.c b/fs/block_dev.c index 20a2c02..209ea33 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1530,12 +1530,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) kill_bdev(bdev); bdev_write_inode(bdev); - /* - * Detaching bdev inode from its wb in __destroy_inode() - * is too late: the queue which embeds its bdi (along with - * root wb) can be gone as soon as we put_disk() below. - */ - inode_detach_wb(bdev->bd_inode); + inode_detach_blkdev_wb(bdev); } if (bdev->bd_contains == bdev) { if (disk->fops->release) diff --git a/include/linux/writeback.h b/include/linux/writeback.h index d0b5ca5..ec1f530 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -230,6 +230,25 @@ static inline void inode_detach_wb(struct inode *inode) } /** + * inode_detach_blkdev_wb - disassociate a bd_inode from its wb + * @bdev: block_device of interest + * + * @bdev is being put for the last time. Detaching bdev inode in + * __destroy_inode() is too late: the queue which embeds its bdi (along + * with root wb) can be gone as soon as the containing disk is put. + * + * This function dissociates @bdev->bd_inode from its wb. The inode must + * be clean and no further operations should be started on it. + */ +static inline void inode_detach_blkdev_wb(struct block_device *bdev) +{ + if (bdev->bd_inode->i_wb) { + flush_delayed_work(&bdev->bd_inode->i_wb->dwork); + inode_detach_wb(bdev->bd_inode); + } +} + +/** * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite * @wbc: writeback_control of interest * @inode: target inode @@ -277,6 +296,10 @@ static inline void inode_detach_wb(struct inode *inode) { } +static inline void inode_detach_blkdev_wb(struct block_device *bdev) +{ +} + static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc, struct inode *inode) __releases(&inode->i_lock) |
| Re: fs: GPF in locked_inode_to_wb_and_lock_list | Dmitry Vyukov | 22/04/16 11:55 ص | I've merged it into my tree and will restart the fuzzer and leave it
running for the weekend. Though, yeah, it is difficult to reproduce... |
| Re: fs: GPF in locked_inode_to_wb_and_lock_list | Dmitry Vyukov | 06/06/16 10:46 ص | Hi Tejun,
I have not seem the crash since then. Please mail the patch. |
| [PATCH] block: flush writeback dwork before detaching a bdev inode from it | Tejun Heo | 17/06/16 09:04 ص | 43d1c0eb7e11 ("block: detach bdev inode from its wb in__blkdev_put()") detached bdev inode from its wb as the bdev inode may outlive the underlying bdi and thus the wb. This is accomplished by invoking inode_detach_wb() from __blkdev_put(); however, while the inode can't be dirtied by the time control reaches there, that doesn't guarantee that writeback isn't already in progress on the inode. This can lead to the inode being disassociated from its wb while writeback operation is in flight causing oopses like the following.
Modules linked in: ---[ end trace 986a4d314dcb2694 ]---Fix it by flushing the wb dwork before detaching the inode. Combined with the fact that the inode can no longer be dirtied, this guarantees that no writeback operation can be in flight or initiated. As this involves details on writeback side which doesn't have much to do with block_dev, encapsulate it in a helper function - inode_detach_blkdev_wb(). Signed-off-by: Tejun Heo <t...@kernel.org> Reported-and-tested-by: Dmitry Vyukov <dvy...@google.com> Link: http://lkml.kernel.org/g/CACT4Y+YAjq8mcfiVxR075didJKCyOCVrqxdbfKdgUxabstbfmA@mail.gmail.com Fixes: 43d1c0eb7e11 ("block: detach bdev inode from its wb in __blkdev_put()") Cc: Ilya Dryomov <idry...@gmail.com> Cc: sta...@vger.kernel.org # v4.2+ --- fs/block_dev.c | 7 +------ include/linux/writeback.h | 23 +++++++++++++++++++++++ 2 files changed, 24 insertions(+), 6 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 71ccab1..6a14100 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1612,12 +1612,7 @@ static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part) |
| Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it | Jan Kara | 20/06/16 06:31 ص | Hi,
<snip> Sorry for the late reply but now when thinking about the patch I don't think it is quite right. Writeback can happen from other contexts than just the worker one (e.g. kswapd can do writeback, or in some out-of-memory situations we may punt to doing writeback directly instead of calling the worker, or sync(2) calls fdatawrite() for block device inode directly when iterating through blockdev superblock). So flushing the workqueue IMHO is not covering 100% of cases. I wanted to suggest to use inode_wait_for_writeback() which will make sure writeback is done with the inode. However we effectively already do this by calling bdev_write_inode() which calls writeback_single_inode() in WB_SYNC_ALL mode. So by the time that call completes we are sure writeback code is not looking at the inode. *However* what I think is happening is that sync_blockdev() writes all the dirty pages, bdev_write_inode() writes the inode and clears all dirty bits, however the inode still stays in the b_dirty / b_io list of the wb because it has I_DIRTY_TIME set. Subsequently once flusher work runs, it finds the inode, looks at it and boom. And the problem seems to be that write_inode_now(inode, true) does not result in I_DIRTY_TIME being cleared. Attached patch should fix this issue - it is compile-tested only. Dmitry, can you check whether this patch fixes the issue for you as well? Honza -- Jan Kara <ja...@suse.com> SUSE Labs, CR |
| Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it | Dmitry Vyukov | 20/06/16 06:39 ص | I can't directly test it because crash happened very infrequently.
If Tejun/Ted agree that it is the right way to fix it, then I can patch it, restart the fuzzer and leave it running for a while. |
| Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it | Tejun Heo | 20/06/16 10:40 ص | Hello,
Hmmm, yeah, the patch undoes what the cgroup writeback changes added but it looks like the addition was exposing the existing problem more rather than causing a new one. Yes, please try out the patch. Thanks a lot. -- tejun |
| Re: [PATCH] block: flush writeback dwork before detaching a bdev inode from it | Dmitry Vyukov | 21/06/16 05:58 ص | Done
At least it should catch any new introduced bugs. |