mm: GPF in bdi

Dmitry Vyukov

unread,

Feb 27, 2017, 12:11:32 PM2/27/17

to Al Viro, linux-...@vger.kernel.org, LKML, Jens Axboe, Andrew Morton, Tejun Heo, Jan Kara, Johannes Weiner, linu...@kvack.org, Andrey Ryabinin, syzkaller

Hello,

The following program triggers GPF in bdi_put:
https://gist.githubusercontent.com/dvyukov/15b3e211f937ff6abc558724369066ce/raw/cc017edf57963e30175a6a6fe2b8d917f6e92899/gistfile1.txt

general protection fault: 0000 [#1] SMP KASAN
Modules linked in:
CPU: 0 PID: 2952 Comm: a.out Not tainted 4.10.0+ #229
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880063e72180 task.stack: ffff880064a78000
RIP: 0010:__read_once_size include/linux/compiler.h:247 [inline]
RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline]
RIP: 0010:refcount_sub_and_test include/linux/refcount.h:156 [inline]
RIP: 0010:refcount_dec_and_test include/linux/refcount.h:181 [inline]
RIP: 0010:kref_put include/linux/kref.h:71 [inline]
RIP: 0010:bdi_put+0x8b/0x1d0 mm/backing-dev.c:914
RSP: 0018:ffff880064a7f0b0 EFLAGS: 00010202
RAX: 0000000000000007 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff880064a7f118 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff880064a7f140 R08: ffff880065603280 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000001 R12: dffffc0000000000
R13: 0000000000000038 R14: 1ffff1000c94fe17 R15: ffff880064a7f218
FS: 0000000000eb5880(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020914ffa CR3: 000000006bc37000 CR4: 00000000001426f0
Call Trace:
bdev_evict_inode+0x203/0x3a0 fs/block_dev.c:888
evict+0x46e/0x980 fs/inode.c:553
iput_final fs/inode.c:1515 [inline]
iput+0x589/0xb20 fs/inode.c:1542
dentry_unlink_inode+0x43b/0x600 fs/dcache.c:343
__dentry_kill+0x34d/0x740 fs/dcache.c:538
dentry_kill fs/dcache.c:579 [inline]
dput.part.27+0x5ce/0x7c0 fs/dcache.c:791
dput fs/dcache.c:753 [inline]
do_one_tree+0x43/0x50 fs/dcache.c:1454
shrink_dcache_for_umount+0xbb/0x2b0 fs/dcache.c:1468
generic_shutdown_super+0xcd/0x4c0 fs/super.c:421
kill_anon_super+0x3c/0x50 fs/super.c:988
deactivate_locked_super+0x88/0xd0 fs/super.c:309
deactivate_super+0x155/0x1b0 fs/super.c:340
cleanup_mnt+0xb2/0x160 fs/namespace.c:1112
__cleanup_mnt+0x16/0x20 fs/namespace.c:1119
task_work_run+0x18a/0x260 kernel/task_work.c:116
tracehook_notify_resume include/linux/tracehook.h:191 [inline]
exit_to_usermode_loop+0x23b/0x2a0 arch/x86/entry/common.c:160
prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
syscall_return_slowpath+0x4d3/0x570 arch/x86/entry/common.c:259
entry_SYSCALL_64_fastpath+0xc0/0xc2
RIP: 0033:0x435e19
RSP: 002b:00007ffc9d7f2748 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffea RBX: 0100000000000000 RCX: 0000000000435e19
RDX: 0000000020063000 RSI: 0000000020914ffa RDI: 0000000020037000
RBP: 00007ffc9d7f2fe0 R08: 0000000020039000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000402b70 R14: 0000000000402c00 R15: 0000000000000000
Code: 04 f2 f2 f2 c7 40 08 f3 f3 f3 f3 e8 f0 ec de ff 48 8d 45 98 48
8b 95 70 ff ff ff 48 c1 e8 03 42 c6 04 20 04 4c 89 e8 48 c1 e8 03 <42>
0f b6 0c 20 4c 89 e8 83 e0 07 83 c0 03 38 c8 7c 08 84 c9 0f
RIP: __read_once_size include/linux/compiler.h:247 [inline] RSP:
ffff880064a7f0b0
RIP: atomic_read arch/x86/include/asm/atomic.h:26 [inline] RSP: ffff880064a7f0b0
RIP: refcount_sub_and_test include/linux/refcount.h:156 [inline] RSP:
ffff880064a7f0b0
RIP: refcount_dec_and_test include/linux/refcount.h:181 [inline] RSP:
ffff880064a7f0b0
RIP: kref_put include/linux/kref.h:71 [inline] RSP: ffff880064a7f0b0
RIP: bdi_put+0x8b/0x1d0 mm/backing-dev.c:914 RSP: ffff880064a7f0b0
---[ end trace 8991b3d16ac9bf93 ]---

On commit e5d56efc97f8240d0b5d66c03949382b6d7e5570.

Dmitry Vyukov

unread,

Feb 27, 2017, 12:14:36 PM2/27/17

to Al Viro, linux-...@vger.kernel.org, LKML, Jens Axboe, Andrew Morton, Tejun Heo, Jan Kara, Johannes Weiner, linu...@kvack.org, Andrey Ryabinin, syzkaller

I also wee the following WARNING. Do you think it' the same underlying bug?

------------[ cut here ]------------
WARNING: CPU: 1 PID: 24265 at mm/backing-dev.c:899
bdi_exit+0x13e/0x160 mm/backing-dev.c:899
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 24265 Comm: syz-executor3 Not tainted 4.10.0-next-20170227+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
panic+0x1fb/0x412 kernel/panic.c:179
__warn+0x1c4/0x1e0 kernel/panic.c:540
warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
bdi_exit+0x13e/0x160 mm/backing-dev.c:899
release_bdi+0x19/0x30 mm/backing-dev.c:908
kref_put include/linux/kref.h:72 [inline]
bdi_put+0x2a/0x40 mm/backing-dev.c:914

bdev_evict_inode+0x203/0x3a0 fs/block_dev.c:888
evict+0x46e/0x980 fs/inode.c:553
iput_final fs/inode.c:1515 [inline]
iput+0x589/0xb20 fs/inode.c:1542
dentry_unlink_inode+0x43b/0x600 fs/dcache.c:343
__dentry_kill+0x34d/0x740 fs/dcache.c:538
dentry_kill fs/dcache.c:579 [inline]
dput.part.27+0x5ce/0x7c0 fs/dcache.c:791
dput fs/dcache.c:753 [inline]
do_one_tree+0x43/0x50 fs/dcache.c:1454
shrink_dcache_for_umount+0xbb/0x2b0 fs/dcache.c:1468
generic_shutdown_super+0xcd/0x4c0 fs/super.c:421
kill_anon_super+0x3c/0x50 fs/super.c:988
deactivate_locked_super+0x88/0xd0 fs/super.c:309
deactivate_super+0x155/0x1b0 fs/super.c:340
cleanup_mnt+0xb2/0x160 fs/namespace.c:1112
__cleanup_mnt+0x16/0x20 fs/namespace.c:1119
task_work_run+0x18a/0x260 kernel/task_work.c:116
tracehook_notify_resume include/linux/tracehook.h:191 [inline]
exit_to_usermode_loop+0x23b/0x2a0 arch/x86/entry/common.c:160
prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
syscall_return_slowpath+0x4d3/0x570 arch/x86/entry/common.c:259
entry_SYSCALL_64_fastpath+0xc0/0xc2

RIP: 0033:0x44fb79
RSP: 002b:00007fd57a8a0b58 EFLAGS: 00000212 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffea RBX: 0000000000708000 RCX: 000000000044fb79
RDX: 00000000208cf000 RSI: 0000000020058ffd RDI: 0000000020fc2000
RBP: 00000000000002f7 R08: 0000000020691000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000212 R12: 0000000020fc2000
R13: 0000000020058ffd R14: 00000000208cf000 R15: 0000000000000000

Al Viro

unread,

Feb 27, 2017, 1:28:12 PM2/27/17

to Dmitry Vyukov, linux-...@vger.kernel.org, LKML, Jens Axboe, Andrew Morton, Tejun Heo, Jan Kara, Johannes Weiner, linu...@kvack.org, Andrey Ryabinin, syzkaller

On Mon, Feb 27, 2017 at 06:11:11PM +0100, Dmitry Vyukov wrote:
> Hello,
>
> The following program triggers GPF in bdi_put:
> https://gist.githubusercontent.com/dvyukov/15b3e211f937ff6abc558724369066ce/raw/cc017edf57963e30175a6a6fe2b8d917f6e92899/gistfile1.txt

What happens is
* attempt of, essentially, mount -t bdev ..., calls mount_pseudo()
and then promptly destroys the new instance it has created.
* the only inode created on that sucker (root directory, that
is) gets evicted.
* most of ->evict_inode() is harmless, until it gets to
if (bdev->bd_bdi != &noop_backing_dev_info)
bdi_put(bdev->bd_bdi);

added there by "block: Make blk_get_backing_dev_info() safe without open bdev".
Since ->bd_bdi hadn't been initialized for that sucker (the same patch has
placed initialization into bdget()), we step into shit of varying nastiness,
depending on phase of moon, etc.

Could somebody explain WTF do we have those two lines in bdev_evict_inode(),
anyway? We set ->bd_bdi to something other than noop_backing_dev_info only
in __blkdev_get() when ->bd_openers goes from zero to positive, so why is
the matching bdi_put() not in __blkdev_put()? Jan?

Dmitry Vyukov

unread,

Feb 28, 2017, 12:56:16 PM2/28/17

to Al Viro, linux-...@vger.kernel.org, LKML, Jens Axboe, Andrew Morton, Tejun Heo, Jan Kara, Johannes Weiner, linu...@kvack.org, Andrey Ryabinin, syzkaller

I am also seeing the following crashes on
linux-next/8d01c069486aca75b8f6018a759215b0ed0c91f0. Do you think it's
the same underlying issue?

kasan: GPF could be caused by NULL-ptr deref or user memory access

general protection fault: 0000 [#1] SMP KASAN

Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 19552 Comm: syz-executor2 Not tainted 4.10.0-next-20170228+ #2

Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011

task: ffff8801c16ae400 task.stack: ffff880154c98000
RIP: 0010:__read_once_size include/linux/compiler.h:254 [inline]

RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline]

RIP: 0010:refcount_sub_and_test+0x82/0x1f0 lib/refcount.c:120
RSP: 0018:ffff880154c9f078 EFLAGS: 00010202
RAX: 0000000000000007 RBX: dffffc0000000000 RCX: ffffc90001a8f000
RDX: 0000000000000740 RSI: ffffffff8246160f RDI: 0000000000000001
RBP: ffff880154c9f110 R08: ffffe8ffffc29a28 R09: 0000000000000001
R10: 1ffff1002a993dcc R11: 0000000000000001 R12: 0000000000000038
R13: 0000000000000001 R14: ffff880154c9f0e8 R15: 1ffff1002a993e11
FS: 00007f0335223700(0000) GS:ffff8801dbe00000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 0000000020fd3ff8 CR3: 00000001c4580000 CR4: 00000000001406f0
DR0: 0000000020000000 DR1: 0000000020001000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
refcount_dec_and_test+0x1a/0x20 lib/refcount.c:153
kref_put include/linux/kref.h:71 [inline]
bdi_put+0x19/0x40 mm/backing-dev.c:914

bdev_evict_inode+0x203/0x3a0 fs/block_dev.c:888
evict+0x46e/0x980 fs/inode.c:553
iput_final fs/inode.c:1515 [inline]
iput+0x589/0xb20 fs/inode.c:1542
dentry_unlink_inode+0x43b/0x600 fs/dcache.c:343
__dentry_kill+0x34d/0x740 fs/dcache.c:538
dentry_kill fs/dcache.c:579 [inline]
dput.part.27+0x5ce/0x7c0 fs/dcache.c:791
dput fs/dcache.c:753 [inline]
do_one_tree+0x43/0x50 fs/dcache.c:1454
shrink_dcache_for_umount+0xbb/0x2b0 fs/dcache.c:1468
generic_shutdown_super+0xcd/0x4c0 fs/super.c:421
kill_anon_super+0x3c/0x50 fs/super.c:988
deactivate_locked_super+0x88/0xd0 fs/super.c:309
deactivate_super+0x155/0x1b0 fs/super.c:340
cleanup_mnt+0xb2/0x160 fs/namespace.c:1112
__cleanup_mnt+0x16/0x20 fs/namespace.c:1119
task_work_run+0x18a/0x260 kernel/task_work.c:116
tracehook_notify_resume include/linux/tracehook.h:191 [inline]
exit_to_usermode_loop+0x23b/0x2a0 arch/x86/entry/common.c:160
prepare_exit_to_usermode arch/x86/entry/common.c:190 [inline]
syscall_return_slowpath+0x4d3/0x570 arch/x86/entry/common.c:259
entry_SYSCALL_64_fastpath+0xc0/0xc2

RIP: 0033:0x44fb79
RSP: 002b:00007f0335222b58 EFLAGS: 00000212 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffea RBX: 0000000000708150 RCX: 000000000044fb79
RDX: 000000002064e000 RSI: 00000000208f8ff8 RDI: 0000000020b28ff8
RBP: 00000000000002f7 R08: 0000000000000000 R09: 0000000000000000
R10: 8000000000000001 R11: 0000000000000212 R12: 0000000020b28ff8
R13: 00000000208f8ff8 R14: 000000002064e000 R15: 0000000000000000
Code: 00 f1 f1 f1 f1 c7 40 04 04 f2 f2 f2 c7 40 08 f3 f3 f3 f3 e8 71
02 2d ff 48 8d 45 98 48 c1 e8 03 c6 04 18 04 4c 89 e0 48 c1 e8 03 <0f>
b6 14 18 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
RIP: __read_once_size include/linux/compiler.h:254 [inline] RSP:
ffff880154c9f078
RIP: atomic_read arch/x86/include/asm/atomic.h:26 [inline] RSP: ffff880154c9f078
RIP: refcount_sub_and_test+0x82/0x1f0 lib/refcount.c:120 RSP: ffff880154c9f078
---[ end trace 3457479bd0ed5045 ]---

Al Viro

unread,

Feb 28, 2017, 1:24:14 PM2/28/17

to Dmitry Vyukov, linux-...@vger.kernel.org, LKML, Jens Axboe, Andrew Morton, Tejun Heo, Jan Kara, Johannes Weiner, linu...@kvack.org, Andrey Ryabinin, syzkaller

On Tue, Feb 28, 2017 at 06:55:55PM +0100, Dmitry Vyukov wrote:

> I am also seeing the following crashes on
> linux-next/8d01c069486aca75b8f6018a759215b0ed0c91f0. Do you think it's
> the same underlying issue?

Yes.
1) Any attempt of mount -t bdev will fail, as it should
2) bdevfs instance created by that attempt will be immediately
destroyed (again, as it should)
3) the sole inode ever created for that instance (its root directory)
will be destroyed in process (again, as it should)
4) that inode has never had ->bd_bdi initialized - the value stored
there would have been whatever garbage kmem_cache_alloc() has left behind
5) bdev_evict_inode() will be called for that inode and if
aforementioned garbage happens to be not equal to &noop_backing_dev_info,
the pointer will be passed to bdi_put().

If that inode happens to reuse the memory previously occupied by a bdev
inode of a looked up but never opened block device, it will have ->bd_bdi
still equal to &noop_backing_dev_info, so that crap does not trigger
every time. That's what the junk (recvmsg/ioctl/etc.) in your reproducer
is affecting. Specific effects of bdi_put() will, of course, depend upon
the actual garbage found there - silent decrement of refcount of an existing
bdi setting the things up for later use-after-free, outright memory
corruption, etc.

_Any_ stack trace of form sys_mount() -> ... -> bdev_evict_inode() ->
bdi_put() -> <barf> is almost certainly the same bug.

I would still like to hear from Jan regarding the reasons why we do that
bdi_put() from bdev_evict_inode() and not in __blkdev_put(). My preference
would be to do it there (and reset ->bd_bdi to &noop_backing_dev_info) when
->bd_openers hits 0. And drop that code from bdev_evict_inode()...

Objections?

Jan Kara

unread,

Mar 1, 2017, 9:29:20 AM3/1/17

to Al Viro, Dmitry Vyukov, linux-...@vger.kernel.org, LKML, Jens Axboe, Andrew Morton, Tejun Heo, Jan Kara, Johannes Weiner, linu...@kvack.org, Andrey Ryabinin, syzkaller

On Mon 27-02-17 18:27:55, Al Viro wrote:
> On Mon, Feb 27, 2017 at 06:11:11PM +0100, Dmitry Vyukov wrote:
> > Hello,
> >
> > The following program triggers GPF in bdi_put:
> > https://gist.githubusercontent.com/dvyukov/15b3e211f937ff6abc558724369066ce/raw/cc017edf57963e30175a6a6fe2b8d917f6e92899/gistfile1.txt
>
> What happens is
> * attempt of, essentially, mount -t bdev ..., calls mount_pseudo()
> and then promptly destroys the new instance it has created.
> * the only inode created on that sucker (root directory, that
> is) gets evicted.
> * most of ->evict_inode() is harmless, until it gets to
> if (bdev->bd_bdi != &noop_backing_dev_info)
> bdi_put(bdev->bd_bdi);

Thanks for the analysis!

> added there by "block: Make blk_get_backing_dev_info() safe without open bdev".
> Since ->bd_bdi hadn't been initialized for that sucker (the same patch has
> placed initialization into bdget()), we step into shit of varying nastiness,
> depending on phase of moon, etc.

Yup, I've missed that the root inode of bdev superblock does not go through
bdget() (in fact I didn't think what happens with root inode for bdev
superblock at all) and thus bd_bdi is left uninitialized in that case. I'll
send a fix for that in a while.

> Could somebody explain WTF do we have those two lines in bdev_evict_inode(),
> anyway? We set ->bd_bdi to something other than noop_backing_dev_info only
> in __blkdev_get() when ->bd_openers goes from zero to positive, so why is
> the matching bdi_put() not in __blkdev_put()? Jan?

The problem is writeback code (from flusher work or through sync(2) -
generally inode_to_bdi() users) can be looking at bdev inode independently
from it being open. So if they start looking while the bdev is open but the
dereference happens after it is closed and device removed, we oops. We have
seen oopses due to this for quite a while. And all the stuff that is done
in __blkdev_put() is not enough to prevent writeback code from having a
look whether there is not something to write.

So what we do now is that once we establish valid bd_bdi reference, we
leave it alone until bdev inode gets evicted. And to handle the case when
underlying device actually changes, we unhash bdev inode when the device
gets removed from the system so that it cannot be found by bdget() anymore.

Honza
--
Jan Kara <ja...@suse.com>
SUSE Labs, CR

Jan Kara

unread,

Mar 1, 2017, 10:05:48 AM3/1/17

to Al Viro, Dmitry Vyukov, linux-...@vger.kernel.org, LKML, Jens Axboe, Andrew Morton, Tejun Heo, Jan Kara, Johannes Weiner, linu...@kvack.org, Andrey Ryabinin, syzkaller

Attached patch fixes the problem for me. I'll post it officially tomorrow
once Al has a chance to reply...

0001-block-Initialize-bd_bdi-on-inode-initialization.patch

Al Viro

unread,

Mar 2, 2017, 6:45:06 AM3/2/17

to Jan Kara, Dmitry Vyukov, linux-...@vger.kernel.org, LKML, Jens Axboe, Andrew Morton, Tejun Heo, Johannes Weiner, linu...@kvack.org, Andrey Ryabinin, syzkaller

On Wed, Mar 01, 2017 at 03:29:09PM +0100, Jan Kara wrote:

> The problem is writeback code (from flusher work or through sync(2) -
> generally inode_to_bdi() users) can be looking at bdev inode independently
> from it being open. So if they start looking while the bdev is open but the
> dereference happens after it is closed and device removed, we oops. We have
> seen oopses due to this for quite a while. And all the stuff that is done
> in __blkdev_put() is not enough to prevent writeback code from having a
> look whether there is not something to write.

Um. What's to prevent the queue/device/module itself from disappearing
from under you? IOW, what are you doing that is safe to do in face of
driver going rmmoded?

Jan Kara

unread,

Mar 2, 2017, 7:20:53 AM3/2/17

to Al Viro, Jan Kara, Dmitry Vyukov, linux-...@vger.kernel.org, LKML, Jens Axboe, Andrew Morton, Tejun Heo, Johannes Weiner, linu...@kvack.org, Andrey Ryabinin, syzkaller

So BDI does not have direct relation to the device itself. It is an
abstraction for some of the device properties / functionality and thus it
can live even after the device itself went away and the module got removed.
The only thing users of bdi want is to tell them whether the device is
congested or various statistics and dirty inode tracking for writeback
purposes and that is all independent of the particular device or whether it
still exists.

Technically there may be pointers bdi->dev, bdi->owner to the device which
are properly refcounted (so the device structure or module cannot be
removed under us). These references get dropped & cleared in
bdi_unregister() generally called from blk_cleanup_queue() (will be moved
to del_gendisk() soon) when the device is going away. This can happen while
e.g. bdev still references the bdi so users of bdi->dev or bdi->owner have
to be careful to sychronize against device removal and bdi_unregister() but
there are only very few such users.

Reply all

Reply to author

Forward

mm: GPF in bdi_put

Dmitry Vyukov

Dmitry Vyukov

Al Viro

Dmitry Vyukov

Al Viro

Jan Kara

Jan Kara

Al Viro

Jan Kara