fs: NULL deref in atime_needs_update

146 views
Skip to first unread message

Dmitry Vyukov

unread,
Feb 5, 2016, 4:11:42 PM2/5/16
to Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
Hello,

I've hit the following GPF while running syzkaller fuzzer:

general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
Modules linked in:
CPU: 1 PID: 5178 Comm: syz-executor Not tainted 4.5.0-rc2+ #65
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880064768000 ti: ffff8800622c0000 task.ti: ffff8800622c0000
RIP: 0010:[<ffffffff8181aa5d>] [<ffffffff8181aa5d>]
atime_needs_update+0x2d/0x460
RSP: 0018:ffff8800622c7a30 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000000000c
RBP: ffff8800622c7a58 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff8800622c7c08
R13: ffff8800622c7c08 R14: ffff8800301ca322 R15: ffff8800622c7bb0
FS: 00007fd1c9f8b700(0000) GS:ffff88003ed00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000020f31000 CR3: 0000000062274000 CR4: 00000000000006e0
Stack:
ffff8800622c7bf4 0000000000000000 ffff8800622c7c08 ffff8800301ca322
ffff8800622c7bb0 ffff8800622c7b38 ffffffff817ecd91 ffff880030bf5200
ffff8800622c7bb8 1ffff1000c458f56 ffff8800622c7c00 ffff8800622c7be0
Call Trace:
[< inline >] get_link fs/namei.c:1006
[<ffffffff817ecd91>] link_path_walk+0xaf1/0x1030 fs/namei.c:1968
[<ffffffff817ed311>] path_parentat+0x41/0x150 fs/namei.c:2176
[<ffffffff817f4c5c>] filename_parentat+0x17c/0x3c0 fs/namei.c:2198
[< inline >] user_path_parent fs/namei.c:2412
[< inline >] SYSC_renameat2 fs/namei.c:4411
[< inline >] SyS_renameat2 fs/namei.c:4375
[< inline >] SYSC_renameat fs/namei.c:4521
[<ffffffff817f9a72>] SyS_renameat+0x192/0x820 fs/namei.c:4518
[<ffffffff8669e0b6>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185
Code: 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 89 f3 e8 08 25 d5
ff 48 8d 7b 0c 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f>
b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
RIP [<ffffffff8181aa5d>] atime_needs_update+0x2d/0x460 fs/inode.c:1611
RSP <ffff8800622c7a30>
---[ end trace 1a4c9bda4680ce46 ]---

On commit df48ab3c2f5ffca88b7803ffbadd074bd5a0a2ef.

Objdump shows that inode is NULL in atime_needs_update.

Unfortunately reproduction of this crash is very hard. The program
executes something along the lines of:

mmap(0x20000000, 15945728, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
mkdir("./bus", 0662515705056234013740) = 0
openat(AT_FDCWD, "./bus", O_RDONLY|O_EXCL) = 3
symlinkat("../bus", 3, "./bus") = 0
renameat(3, "./bus", 3, "./bus/file0") = 0
mmap(0x20f35000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20f35000
mount("./bus", "./bus", 0x20f2aee4,
MS_RDONLY|MS_NODEV|MS_RELATIME|MS_NODIRATIME|MS_BIND|MS_MOVE|MS_REC|MS_UNBINDABLE|MS_SLAVE|MS_SHARED|0xc000380,
0x20093f5f) = 0
open("./bus/file0", O_RDWR|O_EXCL) = -1 EISDIR (Is a directory)
exit_group(0) = ?

But in multiple threads so that some calls can be doubled and/or
overlapped. And all this happens on a tmpfs mount.

But I was able to reproduce it 8 or so times, so I am sure that it is real.

For future reference, I was running these programs:
https://gist.githubusercontent.com/dvyukov/124c457d308fa724d88a/raw/fec2d86e125a7fd2fa2916791d65d7daead7cbbb/gistfile1.txt
Following these instructions:
https://github.com/google/syzkaller/wiki/How-to-execute-syzkaller-programs

Mickaël Salaün

unread,
Feb 16, 2016, 6:40:30 PM2/16/16
to Dmitry Vyukov, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
Hi,

Actually I found the same bug (without fuzzing) and I can reproduce it in a deterministic way (e.g. by creating a LSM that return 1 for the security_file_open hook). At least, from v4.2.8 I can easily trigger traces like this :

BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
IP: [<ffffffff81170871>] atime_needs_update+0x11/0xc0
PGD 127b17067 PUD 12ab2e067 PMD 0
Oops: 0000 [#45] SMP
[...]
RIP: 0010:[<ffffffff81170871>] [<ffffffff81170871>] atime_needs_update+0x11/0xc0
RSP: 0018:ffff880127853c18 EFLAGS: 00010246
RAX: ffff88012ad0c080 RBX: ffff88012ad0c1d8 RCX: ffff88012ad0c080
RDX: 0000000000000000 RSI: ffff88012ad0c1d8 RDI: ffff880127853d98
RBP: ffff880127853c28 R08: ffff8800cc0a2540 R09: ffff8800cfbfc320
R10: ffff8800cc0a2540 R11: 0000000000000001 R12: ffff8800cb5d6300
R13: 0000000000000000 R14: ffff88012ad0c080 R15: ffff880127853e7c
FS: 00007f1054aae700(0000) GS:ffff88012fc40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000050 CR3: 0000000127977000 CR4: 00000000000406e0
Stack:
ffff88012ad0c1d8 ffff8800cb5d6300 ffff880127853c60 ffffffff8117094e
ffff8800c9ade3c0 0000000000000000 00000000a670294f ffff880127853d70
ffff880127853d98 ffff880127853c98 ffffffff8116071c ffff8800cb4ada80
Call Trace:
[<ffffffff8117094e>] ? touch_atime+0x2e/0xd0
[<ffffffff8116071c>] ? trailing_symlink+0xec/0x280
[<ffffffff81163a78>] ? path_openat+0x468/0x1240
[<ffffffff8111856d>] ? pagevec_lru_move_fn+0xed/0x110
[<ffffffff81117ff0>] ? __activate_page+0x130/0x130
[<ffffffff8116593c>] ? do_filp_open+0x8c/0x100
[<ffffffff81164dec>] ? filename_lookup+0xec/0x180
[<ffffffff8115bc24>] ? do_open_execat+0x74/0x170
[<ffffffff8115d437>] ? do_execveat_common.isra.42+0x1a7/0x6a0
[<ffffffff8115db90>] ? SyS_execve+0x30/0x40
[<ffffffff8156ad65>] ? stub_execve+0x5/0x5
[<ffffffff8156aadb>] ? entry_SYSCALL_64_fastpath+0x16/0x6a
Code: 89 c7 e8 63 eb ff ff 48 89 d8 5b c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 54 53 f6 46 0c 02 75 72 48 8b 56 28 <48> 8b 42 50 a9 01 04 00 00 75 63 f6 c4 08 75 65 4c 8b 27 41 8b
RIP [<ffffffff81170871>] atime_needs_update+0x11/0xc0
RSP <ffff880127853c18>
CR2: 0000000000000050
---[ end trace 97dc4f4bb0214bd8 ]---


Regards,
Mickaël
signature.asc

Dmitry Vyukov

unread,
Feb 19, 2016, 2:32:30 PM2/19/16
to Mickaël Salaün, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
I've hit another GPF in atime_needs_update, but this time from SyS_openat:

kasan: GPF could be caused by NULL-ptr deref or user memory
accessgeneral protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
Modules linked in:
CPU: 0 PID: 20147 Comm: syz-executor Not tainted 4.5.0-rc4+ #329
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88005f154740 ti: ffff88005f048000 task.ti: ffff88005f048000
RIP: 0010:[<ffffffff81818b5d>] [<ffffffff81818b5d>]
atime_needs_update+0x2d/0x460
RSP: 0018:ffff88005f04fa48 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff88005f04fd88
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000000000c
RBP: ffff88005f04fa70 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88005f04fd98
R13: 0000000000000000 R14: ffff88005f04fd98 R15: ffff88005f04fd78
FS: 00007f612639b700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000002003ef84 CR3: 000000006073e000 CR4: 00000000000006f0
Stack:
ffff88005f04fd40 ffff88005f04fe08 0000000000000000 ffff88005f04fd98
ffff88005f04fd78 ffff88005f04fab8 ffffffff817e5572 ffff88005f04fd78
ffff88002bcf02f8 0000000000000001 0000000000000000 ffff88002bcf02f8
Call Trace:
[< inline >] get_link fs/namei.c:1006
[<ffffffff817e5572>] trailing_symlink+0x142/0x760 fs/namei.c:2094
[<ffffffff817ec531>] path_openat+0xbc1/0x5e30 fs/namei.c:3389
[<ffffffff817f4fde>] do_filp_open+0x18e/0x250 fs/namei.c:3421
[<ffffffff817b970c>] do_sys_open+0x1fc/0x420 fs/open.c:1022
[< inline >] SYSC_openat fs/open.c:1049
[<ffffffff817b99a0>] SyS_openat+0x30/0x40 fs/open.c:1043
[<ffffffff86662636>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185
Code: 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 89 f3 e8 c8 32 d5
ff 48 8d 7b 0c 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f>
b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
RIP [<ffffffff81818b5d>] atime_needs_update+0x2d/0x460 fs/inode.c:1611
RSP <ffff88005f04fa48>
---[ end trace 0790795e3dea8fc8 ]---


The program that triggered it was:

mmap(&(0x7f0000000000)=nil, (0x51000), 0x3, 0x32, 0xffffffffffffffff, 0x0)
mkdir(&(0x7f0000002000)="2e2f66696c653000", 0x40)
mount(&(0x7f000001a000)="2e2f66696c653000",
&(0x7f000001a000+0x4cf)="2e2f66696c653000",
&(0x7f000001a000+0xc79)="72616d667300", 0x800,
&(0x7f000003f000-0x7c)="3dd496054f9a5d4176272c354e968b1cf2c6c18792461e7325b5774e9197aad240b4b45ac67faa72d988a86dd8ba348739259df192d01f08ca39a8524fee967d7f39f0ec53ce000ff78ed4b2510c5ae8812ed421db038390d8f3fea242e682c907334bc6fc74f2490e4aac983bdf85c9f5b6f6b288b0074ab201ab")
r1 = open$dir(&(0x7f000004c000+0x7df)="2e2f66696c653000", 0x200000, 0x82)
symlinkat(&(0x7f000002e000)="2e2f66696c65302f66696c653000", r1,
&(0x7f0000033000)="2e2f66696c653000")
openat(r1, &(0x7f000004b000-0x8)="2e2f66696c653000", 0x40000, 0x2)


On commit 1926e54f115725a9248d0c4c65c22acaf94de4c4.

Al Viro

unread,
Feb 19, 2016, 10:21:31 PM2/19/16
to Dmitry Vyukov, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
On Fri, Feb 19, 2016 at 08:32:10PM +0100, Dmitry Vyukov wrote:
> > BUG: unable to handle kernel NULL pointer dereference at 0000000000000050

NULL inode->i_sb, by the look of the offset, but I really don't understand
where the hell is that code doing (or how is that instruction going to
generate dereferencing of 0x50, for that matter).
What was the fault address here and what config are you using for those
builds?

Al Viro

unread,
Feb 19, 2016, 10:54:45 PM2/19/16
to Dmitry Vyukov, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
On Sat, Feb 20, 2016 at 03:21:27AM +0000, Al Viro wrote:
> On Fri, Feb 19, 2016 at 08:32:10PM +0100, Dmitry Vyukov wrote:
> > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
>
> NULL inode->i_sb, by the look of the offset, but I really don't understand
> where the hell is that code doing (or how is that instruction going to
> generate dereferencing of 0x50, for that matter).

BTW, Mickaël's trace *does* make sense and it's definitely NULL inode->i_sb
(inode itself - in %rsi, inode->i_sb - in %rdx, offset of s_flags is 0x50,
the line in question is
if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))

What I don't understand is what could possibly have NULL ->i_sb in *any*
instance of struct inode.

Dmitry Vyukov

unread,
Feb 20, 2016, 5:36:55 AM2/20/16
to Al Viro, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
This is with CONFIG_KASAN. GPF happens on shadow check for NULL
address. In both cases:
RAX: dffffc0000000000
RDX: 0000000000000001
where dffffc0000000000 is shadow base and 0000000000000001 is offset.
So the original memory access was to range [8-16) (i.e. NULL with
small offset).

Mickaël Salaün

unread,
Feb 20, 2016, 8:25:48 AM2/20/16
to Al Viro, Dmitry Vyukov, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
I can also trigger bugs with a bad inode pointer dereference in atime_needs_update: if (inode->i_flags & S_NOATIME)

I think the bug may be somewhere in the nd->depth handling (when its value is 0) in fs/namei.c:get_link(): struct saved *last = nd->stack + nd->depth - 1

Moreover, is it intentional that touch_atime() is called by get_link() even if the access (previously checked with security_file_open(), e.g. by do_last) is denied?

signature.asc

Al Viro

unread,
Feb 20, 2016, 12:10:48 PM2/20/16
to Mickaël Salaün, Dmitry Vyukov, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
On Sat, Feb 20, 2016 at 02:25:40PM +0100, Mickaël Salaün wrote:

> I think the bug may be somewhere in the nd->depth handling (when its value is 0) in fs/namei.c:get_link(): struct saved *last = nd->stack + nd->depth - 1

Getting there with nd->depth == 0 would certainly be a bug - it would mean
that we got there without should_follow_link() having returned 1.

In case of open() it would be "do_last() has returned positive without
should_follow_link() having returned 1".

<looks>

OK, there are several places where we rely on not getting bogus return values
- inode_permission() should not return positives, neither should vfs_open(),
security_path_truncate() and notify_change().

Other similar "handle the last component" functions are guaranteed to
never return positives other than directly from should_follow_link(), so
they are OK.

IIRC, you used LSM to inject a positive value to inode_permission(), right?

Another way to trigger that would've been ->open() returning positive -
a bug on *anything* since ->open() had been introduced in 0.95. Amount of
harm would vary - e.g. 0.95 would simply have that positive number returned
to userland, looking like successful open(2). With no new descriptor, of
course...

Short-term we probably want just
if (unlikely(error > 0)) {
WARN_ON(1);
error = -EINVAL;
}
added right after out: in do_last(), try to trigger Dmitry's reproducers
on it and then work back to the source of that thing *if* that's what's
happening in his case. Yours almost certainly is just that.

Longer-term... I'm not sure. Having a method that is supposed to return 0
or -E<something> actually return positive is going to be a bad thing, no
matter what, but "that bogus value gets passed to userland" is a lot
more tolerable than "kernel memory corruption". do_last() calling conventions
make it vulnerable to the latter, and as far as nd->stack underruns that's
it, but I'm not sure we don't have other places where such bug in driver,
etc. would translate into mess ;-/

OK, in any case, let's start with checking if Dmitry is seeing that and not
something else. I still don't understand his stack traces - the fault
address quoted in his first posting doesn't match the register values in
the same trace, and there's also a possibility that it's an RCU-related
crap. This should give a warning and prevent an oops if we are hitting
a stack underrun on bogus positive from do_last(). Dmitry, could you try
to build with delta below and run your reproducer(s)?

diff --git a/fs/namei.c b/fs/namei.c
index f624d13..e30deef 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3273,6 +3273,10 @@ opened:
goto exit_fput;
}
out:
+ if (unlikely(error > 0)) {
+ WARN_ON(1);
+ error = -EINVAL;
+ }
if (got_write)
mnt_drop_write(nd->path.mnt);
path_put(&save_parent);

Mickaël Salaün

unread,
Feb 20, 2016, 3:26:42 PM2/20/16
to Al Viro, Dmitry Vyukov, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin

On 20/02/2016 18:10, Al Viro wrote:
> On Sat, Feb 20, 2016 at 02:25:40PM +0100, Mickaël Salaün wrote:
>
>> I think the bug may be somewhere in the nd->depth handling (when its value is 0) in fs/namei.c:get_link(): struct saved *last = nd->stack + nd->depth - 1
>
> Getting there with nd->depth == 0 would certainly be a bug - it would mean
> that we got there without should_follow_link() having returned 1.
>
> In case of open() it would be "do_last() has returned positive without
> should_follow_link() having returned 1".

OK, the do_last() return value was the origin of my bug in fs/namei.c:path_openat(): while (!(error = link_path_walk(s, nd)) && (error = do_last(nd, file, op, &opened)) > 0)

>
> <looks>
>
> OK, there are several places where we rely on not getting bogus return values
> - inode_permission() should not return positives, neither should vfs_open(),
> security_path_truncate() and notify_change().
>
> Other similar "handle the last component" functions are guaranteed to
> never return positives other than directly from should_follow_link(), so
> they are OK.
>
> IIRC, you used LSM to inject a positive value to inode_permission(), right?

Yes, my test hook was wrong because it returned 1 (instead of -EPERM or -EACCES) which is an invalid return value.

> diff --git a/fs/namei.c b/fs/namei.c
> index f624d13..e30deef 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3273,6 +3273,10 @@ opened:
> goto exit_fput;
> }
> out:
> + if (unlikely(error > 0)) {
> + WARN_ON(1);
> + error = -EINVAL;
> + }
> if (got_write)
> mnt_drop_write(nd->path.mnt);
> path_put(&save_parent);
>

I think your warning patch should be upstreamed to detect such cases :)

Thanks,
Mickaël

signature.asc

Al Viro

unread,
Feb 20, 2016, 3:50:48 PM2/20/16
to Mickaël Salaün, Dmitry Vyukov, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
On Sat, Feb 20, 2016 at 09:26:28PM +0100, Mickaël Salaün wrote:
> > + if (unlikely(error > 0)) {
> > + WARN_ON(1);
> > + error = -EINVAL;
> > + }
> > if (got_write)
> > mnt_drop_write(nd->path.mnt);
> > path_put(&save_parent);
> >
>
> I think your warning patch should be upstreamed to detect such cases :)

I'm not sure whether it's better done there or one step closer to the
source - e.g. telling whether it was bogus ->open() or a bogus LSM hook
(and _which_ bogus LSM hook had it been) would be hard with that location
of test.

It still leaves the question of what's going on in Dmitry's tests - it might
be one of those source or it might be something else entirely; this location
of test would at least tell whether a stack underrun is involved or not...

Dmitry Vyukov

unread,
Feb 22, 2016, 6:20:50 AM2/22/16
to Al Viro, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
I've reproduced the second report (the one originating in openat) with
this patch and the WARNING did _not_ fire:


kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
Modules linked in:
CPU: 2 PID: 17525 Comm: syz-executor Not tainted 4.5.0-rc5+ #331
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88002c6ddf00 ti: ffff88002c740000 task.ti: ffff88002c740000
RIP: 0010:[<ffffffff81821ded>] [<ffffffff81821ded>]
atime_needs_update+0x2d/0x460
RSP: 0018:ffff88002c747a48 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff88002c747d88
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000000000000000c
RBP: ffff88002c747a70 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88002c747d98
R13: 0000000000000000 R14: ffff88002c747d98 R15: ffff88002c747d78
FS: 00007f24da3d9700(0000) GS:ffff88006d600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000002003f000 CR3: 000000002e2d0000 CR4: 00000000000006e0
Stack:
ffff88002c747d40 ffff88002c747e08 0000000000000000 ffff88002c747d98
ffff88002c747d78 ffff88002c747ab8 ffffffff817eeda2 ffff88002c747d78
ffff880030fa88e8 ffff88002c747c98 0000000000000000 ffff88002c747d40
Call Trace:
[< inline >] get_link fs/namei.c:1006
[<ffffffff817eeda2>] trailing_symlink+0x142/0x760 fs/namei.c:2094
[<ffffffff817f5cec>] path_openat+0xb4c/0x5760 fs/namei.c:3393
[<ffffffff817fe13e>] do_filp_open+0x18e/0x250 fs/namei.c:3425
[<ffffffff817c2dbc>] do_sys_open+0x1fc/0x420 fs/open.c:1022
[< inline >] SYSC_openat fs/open.c:1049
[<ffffffff817c3050>] SyS_openat+0x30/0x40 fs/open.c:1043
[<ffffffff8669f6b6>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185
Code: 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 48 89 f3 e8 98 17 d5
ff 48 8d 7b 0c 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f>
b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
RIP [<ffffffff81821ded>] atime_needs_update+0x2d/0x460 fs/inode.c:1617
RSP <ffff88002c747a48>
---[ end trace 872348222bfe81b0 ]---


This is _not_ on a tmpfs mount.

Regarding the registers, here is disassembly. The crash happens on a
KASAN check of the

ffffffff81821dc0 <atime_needs_update>:
ffffffff81821dc0: 55 push %rbp
ffffffff81821dc1: 48 89 e5 mov %rsp,%rbp
ffffffff81821dc4: 41 57 push %r15
ffffffff81821dc6: 41 56 push %r14
ffffffff81821dc8: 41 55 push %r13
ffffffff81821dca: 41 54 push %r12
ffffffff81821dcc: 49 89 fc mov %rdi,%r12
ffffffff81821dcf: 53 push %rbx
ffffffff81821dd0: 48 89 f3 mov %rsi,%rbx
ffffffff81821dd3: e8 98 17 d5 ff callq
ffffffff81573570 <__sanitizer_cov_trace_pc>
ffffffff81821dd8: 48 8d 7b 0c lea 0xc(%rbx),%rdi
ffffffff81821ddc: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
ffffffff81821de3: fc ff df
ffffffff81821de6: 48 89 fa mov %rdi,%rdx
ffffffff81821de9: 48 c1 ea 03 shr $0x3,%rdx
ffffffff81821ded: 0f b6 14 02 movzbl (%rdx,%rax,1),%edx
ffffffff81821df1: 48 89 f8 mov %rdi,%rax
ffffffff81821df4: 83 e0 07 and $0x7,%eax
ffffffff81821df7: 83 c0 03 add $0x3,%eax
ffffffff81821dfa: 38 d0 cmp %dl,%al
ffffffff81821dfc: 7c 08 jl
ffffffff81821e06 <atime_needs_update+0x46>
ffffffff81821dfe: 84 d2 test %dl,%dl
ffffffff81821e00: 0f 85 03 03 00 00 jne
ffffffff81822109 <atime_needs_update+0x349>
ffffffff81821e06: f6 43 0c 02 testb $0x2,0xc(%rbx)
ffffffff81821e0a: 0f 85 1a 02 00 00 jne
ffffffff8182202a <atime_needs_update+0x26a>
ffffffff81821e10: e8 5b 17 d5 ff callq
ffffffff81573570 <__sanitizer_cov_trace_pc>
ffffffff81821e15: 48 8d 7b 28 lea 0x28(%rbx),%rdi
ffffffff81821e19: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
ffffffff81821e20: fc ff df
ffffffff81821e23: 48 89 fa mov %rdi,%rdx
ffffffff81821e26: 48 c1 ea 03 shr $0x3,%rdx
ffffffff81821e2a: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
ffffffff81821e2e: 0f 85 c4 03 00 00 jne
ffffffff818221f8 <atime_needs_update+0x438>

It means that inode is NULL here:

bool atime_needs_update(const struct path *path, struct inode *inode)
{
struct vfsmount *mnt = path->mnt;
struct timespec now;

if (inode->i_flags & S_NOATIME)
return false;

Al Viro

unread,
Feb 22, 2016, 12:23:18 PM2/22/16
to Dmitry Vyukov, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
On Mon, Feb 22, 2016 at 12:20:30PM +0100, Dmitry Vyukov wrote:

> I've reproduced the second report (the one originating in openat) with
> this patch and the WARNING did _not_ fire:

Lovely... Could you dump your inode.o on anonftp somewhere? Or post
the disassembled atime_needs_update, for that matter - shouldn't be
all that long... .config (and gcc version) you are using would also be nice.

On the face of it, NULL inode is a plausible source of that one, but
it's _very_ odd. It would have to be NULL ->link_inode, and since the
warning hadn't triggered, there was a successful should_follow_link(),
with NULL inode argument.

Could you slap WARN_ON(!inode) in pick_link()? Or even
WARN_ON(IS_ERR_OR_NULL(inode)), for that matter...

Dmitry Vyukov

unread,
Feb 23, 2016, 10:35:20 AM2/23/16
to Al Viro, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
I was able to reproduce the crash on vanilla kernel (no KASAN, no
KCOV) with the new WARNING, this one does fire.

So I am on commit 4de8ebeff8ddefaceeb7fc6a9b1a514fc9624509 (Feb 22), with:

$ git diff
diff --git a/fs/namei.c b/fs/namei.c
index f624d13..9675e9e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1669,6 +1669,7 @@ static int pick_link(struct nameidata *nd,
struct path *link,
{
int error;
struct saved *last;
+ WARN_ON(IS_ERR_OR_NULL(inode));
if (unlikely(nd->total_link_count++ >= MAXSYMLINKS)) {
path_to_nameidata(link, nd);
return -ELOOP;
@@ -3273,6 +3274,10 @@ opened:
goto exit_fput;
}
out:
+ if (unlikely(error > 0)) {
+ WARN_ON(1);
+ error = -EINVAL;
+ }
if (got_write)
mnt_drop_write(nd->path.mnt);
path_put(&save_parent);


The crash:

[ 8095.048336] ------------[ cut here ]------------
[ 8095.048864] WARNING: CPU: 3 PID: 5532 at fs/namei.c:1672
should_follow_link.part.25+0x55/0x21a()
[ 8095.049830] Modules linked in:
[ 8095.050155] CPU: 3 PID: 5532 Comm: syz-executor Not tainted 4.5.0-rc5+ #69
[ 8095.050760] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 8095.051104] 0000000000000000 ffff8800005ffc78 ffffffff8194e5f9
0000000000000000
[ 8095.051649] ffffffff8334c24c ffff8800005ffcb0 ffffffff81172291
ffff8800005ffde0
[ 8095.051649] ffff8800005ffd98 0000000000048000 ffff8800005ffde0
ffff8800005ffefc
[ 8095.051649] Call Trace:
[ 8095.051649] [<ffffffff8194e5f9>] dump_stack+0x99/0xd0
[ 8095.054784] [<ffffffff81172291>] warn_slowpath_common+0x81/0xc0
[ 8095.054784] [<ffffffff81172385>] warn_slowpath_null+0x15/0x20
[ 8095.054784] [<ffffffff8185a55c>] should_follow_link.part.25+0x55/0x21a
[ 8095.054784] [<ffffffff813180c9>] path_openat+0x1229/0x1500
[ 8095.057172] [<ffffffff813193d9>] do_filp_open+0x79/0xd0
[ 8095.057172] [<ffffffff82b170c2>] ? _raw_spin_unlock+0x22/0x30
[ 8095.057172] [<ffffffff81328d98>] ? __alloc_fd+0xf8/0x200
[ 8095.058364] [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0
[ 8095.058364] [<ffffffff81306fbf>] SyS_openat+0xf/0x20
[ 8095.058364] [<ffffffff82b17b76>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 8095.060124] ---[ end trace 50e32daa426e4c92 ]---
[ 8095.060526] BUG: unable to handle kernel NULL pointer dereference
at 000000000000000c
[ 8095.061111] IP: [<ffffffff813272b9>] atime_needs_update+0x9/0xc0
[ 8095.061549] PGD 6ae6e067 PUD 6ae6a067 PMD 0
[ 8095.061549] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 8095.061549] Modules linked in:
[ 8095.061549] CPU: 3 PID: 5532 Comm: syz-executor Tainted: G W
4.5.0-rc5+ #69
[ 8095.061549] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 8095.061549] task: ffff88002bc00000 ti: ffff8800005fc000 task.ti:
ffff8800005fc000
[ 8095.061549] RIP: 0010:[<ffffffff813272b9>] [<ffffffff813272b9>]
atime_needs_update+0x9/0xc0
[ 8095.061549] RSP: 0018:ffff8800005ffcb0 EFLAGS: 00010282
[ 8095.061549] RAX: 0000000000000030 RBX: ffff8800005ffde0 RCX: 0000000000000000
[ 8095.061549] RDX: ffff8800005ffe38 RSI: 0000000000000000 RDI: ffff8800005ffe38
[ 8095.061549] RBP: ffff8800005ffcc0 R08: 0000000000000002 R09: 0000000000000001
[ 8095.061549] R10: 0000000000000001 R11: 0000000000001828 R12: 0000000000000000
[ 8095.061549] R13: ffff8800005ffe38 R14: ffff880031950be0 R15: ffff8800005ffefc
[ 8095.061549] FS: 00007fc3bf2ee700(0000) GS:ffff88007f900000(0000)
knlGS:0000000000000000
[ 8095.069746] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 8095.069746] CR2: 000000000000000c CR3: 000000006b982000 CR4: 00000000000006e0
[ 8095.071108] Stack:
[ 8095.071538] ffff8800005ffcc0 ffff8800005ffde0 ffff8800005ffd00
ffffffff81314ca2
[ 8095.072227] 0000000200000002 0000000000000005 ffff8800005ffd98
0000000000048000
[ 8095.072227] ffff8800005ffde0 ffff8800005ffefc ffff8800005ffdd0
ffffffff81317173
[ 8095.072227] Call Trace:
[ 8095.072227] [<ffffffff81314ca2>] trailing_symlink+0x62/0x260
[ 8095.072227] [<ffffffff81317173>] path_openat+0x2d3/0x1500
[ 8095.072227] [<ffffffff813193d9>] do_filp_open+0x79/0xd0
[ 8095.072227] [<ffffffff82b170c2>] ? _raw_spin_unlock+0x22/0x30
[ 8095.072227] [<ffffffff81328d98>] ? __alloc_fd+0xf8/0x200
[ 8095.072227] [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0
[ 8095.072227] [<ffffffff81306fbf>] SyS_openat+0xf/0x20
[ 8095.072227] [<ffffffff82b17b76>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 8095.072227] Code: ff ff ff 48 85 c0 48 89 c3 74 08 48 89 c7 e8 ef
dc ff ff 48 89 d8 5b 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
48 83 ec 08 <f6> 46 0c 02 48 8b 1f 75 6b 48 8b 7e 28 48 8b 47 50 a9 01
04 00
[ 8095.072227] RIP [<ffffffff813272b9>] atime_needs_update+0x9/0xc0
[ 8095.072227] RSP <ffff8800005ffcb0>
[ 8095.072227] CR2: 000000000000000c
[ 8095.116838] ---[ end trace 50e32daa426e4c93 ]---
[ 8095.116838] BUG: sleeping function called from invalid context at
include/linux/sched.h:2795
[ 8095.116838] in_atomic(): 1, irqs_disabled(): 1, pid: 5532, name: syz-executor
[ 8095.116838] INFO: lockdep is turned off.
[ 8095.116838] irq event stamp: 636
[ 8095.116838] hardirqs last enabled at (635): [<ffffffff811d9926>]
vprintk_emit+0x2d6/0x5f0
[ 8095.116838] hardirqs last disabled at (636): [<ffffffff82b1a089>]
error_entry+0x69/0xc0
[ 8095.116838] softirqs last enabled at (632): [<ffffffff81178172>]
__do_softirq+0x222/0x4a0
[ 8095.116838] softirqs last disabled at (623): [<ffffffff81178767>]
irq_exit+0xa7/0xc0
[ 8095.116838] CPU: 3 PID: 5532 Comm: syz-executor Tainted: G D W
4.5.0-rc5+ #69
[ 8095.116838] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 8095.116838] 0000000000000000 ffff8800005ff9a8 ffffffff8194e5f9
ffff88002bc00000
[ 8095.116838] 000000000000159c ffff8800005ff9d0 ffffffff811a0659
ffffffff832a29c3
[ 8095.116838] 0000000000000aeb 0000000000000000 ffff8800005ff9f8
ffffffff811a0764
[ 8095.116838] Call Trace:
[ 8095.116838] [<ffffffff8194e5f9>] dump_stack+0x99/0xd0
[ 8095.116838] [<ffffffff811a0659>] ___might_sleep+0x179/0x240
[ 8095.116838] [<ffffffff811a0764>] __might_sleep+0x44/0x80
[ 8095.116838] [<ffffffff811842cf>] exit_signals+0x1f/0x130
[ 8095.116838] [<ffffffff811756ef>] do_exit+0xbf/0xd10
[ 8095.116838] [<ffffffff811da524>] ? kmsg_dump+0x104/0x180
[ 8095.116838] [<ffffffff8108549f>] oops_end+0x9f/0xe0
[ 8095.116838] [<ffffffff810ce6e8>] no_context+0x108/0x390
[ 8095.116838] [<ffffffff811d7837>] ? print_time.part.13+0x67/0x90
[ 8095.116838] [<ffffffff810cea8d>] __bad_area_nosemaphore+0x11d/0x220
[ 8095.116838] [<ffffffff810ceb9e>] bad_area_nosemaphore+0xe/0x10
[ 8095.116838] [<ffffffff810cf2c4>] __do_page_fault+0x84/0x470
[ 8095.116838] [<ffffffff810cf764>] trace_do_page_fault+0x74/0x2c0
[ 8095.116838] [<ffffffff810c9ba4>] do_async_page_fault+0x14/0x90
[ 8095.116838] [<ffffffff82b19eb8>] async_page_fault+0x28/0x30
[ 8095.116838] [<ffffffff813272b9>] ? atime_needs_update+0x9/0xc0
[ 8095.116838] [<ffffffff81314ca2>] trailing_symlink+0x62/0x260
[ 8095.116838] [<ffffffff81317173>] path_openat+0x2d3/0x1500
[ 8095.116838] [<ffffffff813193d9>] do_filp_open+0x79/0xd0
[ 8095.116838] [<ffffffff82b170c2>] ? _raw_spin_unlock+0x22/0x30
[ 8095.116838] [<ffffffff81328d98>] ? __alloc_fd+0xf8/0x200
[ 8095.116838] [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0
[ 8095.116838] [<ffffffff81306fbf>] SyS_openat+0xf/0x20
[ 8095.116838] [<ffffffff82b17b76>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 8095.195133] note: syz-executor[5532] exited with preempt_count 1


And here is my inode.o:
https://gist.githubusercontent.com/dvyukov/27ec88c2c1a83c2e0f38/raw/2514d0ddd7720a978e6a2f67c2dcb391046ce0e7/gistfile1.txt

This can be reproduced following the instructions here:
https://github.com/google/syzkaller/wiki/How-to-execute-syzkaller-programs
Using this command line:
# ./syz-execprog -cover=0 -procs=60 -repeat=0 prog
with the following program:
https://gist.githubusercontent.com/dvyukov/fc026f36f9f76d1a440b/raw/0e133afa99eb7de45880523fbd48256cd2ae4a6c/gistfile1.txt
(requires CONFIG_USER_NS=y). The crash triggers after hours of execution.

Al Viro

unread,
Feb 23, 2016, 1:17:56 PM2/23/16
to Dmitry Vyukov, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
On Tue, Feb 23, 2016 at 04:34:59PM +0100, Dmitry Vyukov wrote:
> The crash:
>
> [ 8095.048336] ------------[ cut here ]------------
> [ 8095.048864] WARNING: CPU: 3 PID: 5532 at fs/namei.c:1672
> should_follow_link.part.25+0x55/0x21a()

NULL or ERR_PTR() passed as inode to should_follow_link().

> [ 8095.060526] BUG: unable to handle kernel NULL pointer dereference
> at 000000000000000c

OK, NULL inode it is. And that was in do_last().

> And here is my inode.o:
> https://gist.githubusercontent.com/dvyukov/27ec88c2c1a83c2e0f38/raw/2514d0ddd7720a978e6a2f67c2dcb391046ce0e7/gistfile1.txt
>
> This can be reproduced following the instructions here:
> https://github.com/google/syzkaller/wiki/How-to-execute-syzkaller-programs
> Using this command line:
> # ./syz-execprog -cover=0 -procs=60 -repeat=0 prog
> with the following program:
> https://gist.githubusercontent.com/dvyukov/fc026f36f9f76d1a440b/raw/0e133afa99eb7de45880523fbd48256cd2ae4a6c/gistfile1.txt
> (requires CONFIG_USER_NS=y). The crash triggers after hours of execution.

Joy... Another interesting question is whether we'd been in RCU mode at
the time of that should_follow_link(). The thing is, we could've come there
either from
if (!(open_flag & O_CREAT)) {
if (nd->last.name[nd->last.len])
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
error = lookup_fast(nd, &path, &inode, &seq);
if (likely(!error))
goto finish_lookup;

or from
BUG_ON(nd->flags & LOOKUP_RCU);
inode = d_backing_inode(path.dentry);
seq = 0; /* out of RCU mode, so the value doesn't matter */
if (unlikely(d_is_negative(path.dentry))) {
path_to_nameidata(&path, nd);
return -ENOENT;
}
finish_lookup:

In the latter case we are holding a reference to path.dentry, so d_is_negative
ought to be reliable and refering to the same backing inode. In the former,
if we leave still in RCU mode, we went through
*inode = d_backing_inode(dentry);
negative = d_is_negative(dentry);
[check that dentry->d_seq is still unchanged]
...
[check that negative is false]
and that guarantees that both inode and negative had been taken while dentry
remained stable, so we couldn't pass through the second check with NULL
inode. And returning 0 in non-RCU mode means that we go through
if (unlikely(d_is_negative(dentry))) {
dput(dentry);
return -ENOENT;
}
path->mnt = mnt;
path->dentry = dentry;
err = follow_managed(path, nd);
if (likely(!err))
*inode = d_backing_inode(path->dentry);
return err;
with dentry pinned, so NULL inode here is also bloody odd - we have positive
dentry that will remain positive through all that and somehow follow_managed()
(in non-RCU mode) gets us a negative one. Now, follow_managed() either
leaves path->dentry unchanged (and keeps it pinned through all of that), or
does
struct vfsmount *mounted = lookup_mnt(path);
if (mounted) {
dput(path->dentry);
if (need_mntput)
mntput(path->mnt);
path->mnt = mounted;
path->dentry = dget(mounted->mnt_root);
(and ->mnt_root should never be negative), or goes into follow_autmount(),
where we either leave the damn thing unchanged or hit
path->dentry = dget(mnt->mnt_root);
... or we have ->d_automount() instance doing something nasty to it. Damn.

OK, a look through the instances shows that only autofs4 one might modify
path->dentry:
struct dentry *new = d_lookup(parent, &dentry->d_name);
if (!new)
return NULL;
ino = autofs4_dentry_ino(new);
ino->last_used = jiffies;
dput(path->dentry);
path->dentry = new;
in autofs4_mountpoint_changed()... I doubt that this is the cause here,
but let's slap WARN_ON(d_is_negative(new)) there.

The thing is, I *do* see one bug around should_follow_link(), but it would
manifest differently. So you must be hitting something else there, to get
that NULL inode... Could you try to reproduce it with the patch below
and see what warnings trigger? I'll try to reproduce it as well, but since
you already have a working setup...

diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index c6d7d3d..86f81e3 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -323,6 +323,7 @@ static struct dentry *autofs4_mountpoint_changed(struct path *path)
struct dentry *new = d_lookup(parent, &dentry->d_name);
if (!new)
return NULL;
+ WARN_ON(d_is_negative(new));
ino = autofs4_dentry_ino(new);
ino->last_used = jiffies;
dput(path->dentry);
diff --git a/fs/namei.c b/fs/namei.c
index f624d13..ac00bcb 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1209,6 +1209,7 @@ static int follow_managed(struct path *path, struct nameidata *nd)
/* Handle an automount point */
if (managed & DCACHE_NEED_AUTOMOUNT) {
ret = follow_automount(path, nd, &need_mntput);
+ WARN_ON(d_is_negative(path->dentry));
if (ret < 0)
break;
continue;
@@ -1613,8 +1614,10 @@ unlazy:
path->mnt = mnt;
path->dentry = dentry;
err = follow_managed(path, nd);
- if (likely(!err))
+ if (likely(!err)) {
*inode = d_backing_inode(path->dentry);
+ WARN_ON(!inode);
+ }
return err;

need_lookup:
@@ -1712,6 +1715,17 @@ static inline int should_follow_link(struct nameidata *nd, struct path *link,
return 0;
if (!follow)
return 0;
+ /* make sure that d_is_symlink above matches inode */
+ if (nd->flags & LOOKUP_RCU) {
+ if (read_seqcount_retry(&link->dentry->d_seq, seq)) {
+ WARN_ON(1); // just as way to report hitting
+ // that path; it's OK to get
+ // here, in the final variant
+ // WARN_ON will disappear.
+ return -ECHILD;
+ }
+ }
+ WARN_ON(!inode); // now, _that_ should not happen.
return pick_link(nd, link, inode, seq);
}

@@ -3273,6 +3287,10 @@ opened:

Ian Kent

unread,
Feb 23, 2016, 10:12:23 PM2/23/16
to Mickaël Salaün, Dmitry Vyukov, Alexander Viro, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
On Wed, 2016-02-17 at 00:40 +0100, Mickaël Salaün wrote:
> Hi,
>
> Actually I found the same bug (without fuzzing) and I can reproduce it
> in a deterministic way (e.g. by creating a LSM that return 1 for the
> security_file_open hook). At least, from v4.2.8 I can easily trigger
> traces like this :

Reading through this thread I wonder if this is a new problem.

Is there a previous kernel it can't be reproduced on?
Perhaps a bisect will shed some light on what's happening.

Al Viro

unread,
Feb 23, 2016, 11:46:32 PM2/23/16
to Ian Kent, Mickaël Salaün, Dmitry Vyukov, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
On Wed, Feb 24, 2016 at 11:12:13AM +0800, Ian Kent wrote:
> On Wed, 2016-02-17 at 00:40 +0100, Mickaël Salaün wrote:
> > Hi,
> >
> > Actually I found the same bug (without fuzzing) and I can reproduce it
> > in a deterministic way (e.g. by creating a LSM that return 1 for the
> > security_file_open hook). At least, from v4.2.8 I can easily trigger
> > traces like this :
>
> Reading through this thread I wonder if this is a new problem.
>
> Is there a previous kernel it can't be reproduced on?
> Perhaps a bisect will shed some light on what's happening.

There are several things in the mix. What Mickaël has found is that a bunch
of places where _positive_ number returned instead of expected 0 or -E... can
get propagated all way back to caller of do_last(), confusing the hell out
of it. That's not what Dmitry has triggered, though. WARN_ON() in the
end of do_last() would've triggered, and IMO this one, along with mitigation
(map that "error value" to -EINVAL) should go into mainline and all -stable.
Bogus ->open() returning a positive number had always been bad news; in the
best case it would be returned to userland, leading to "open(2) has failed
and returned a positive number". Hell knows if we ever had such instances
(or have them right now), but I wouldn't bet on their absense. Rare
failure exits returning bogus values in an ->open() instance in some driver
can easily stay unnoticed for a long time. Starting from at least 3.6
(circa the atomic_open support) it got more unpleasant than simple "confuse
the hell out of userland". ->open() isn't the only vector for injection of
such crap - ->permission() would also serve, same for several LSM turds, etc.

Again, that's a separate problem. What Dmitry seems to be catching is getting
crap values fed to should_follow_link() as inode. I see one bug in that
area that does need fixing (fix present in the last patch I've posted, with
WARN_ON() to indicate that this thing has triggered; _that_ WARN_ON() should
be gone from the final variant, since this can trigger without driver bugs,
etc.) In RCU mode after we'd checked that dentry is a symlink one, we need
to verify that it hadn't been changed since before we'd fetched the inode.
It might have been e.g. a regular file, which got unlinked with symlink
created in its place. Then we'd go into get_link() with non-symlink inode
and oops on attempt to call its ->i_op->get_link(). That race is real, very
hard to hit (you need both the unlink(2) and symlink(2) to happen between
lookup_fast() and should_follow_link() and unless you have PREEMPT_RCU you
can't lose the timeslice there) and would've manifested differently.

But that leaves other two kinds of bugs getting triggered: inode of some
non-symlink is possible, but what we saw included NULL inode when we'd
reached finish_open: in do_last(). Should be flat-out impossible - we either
get lookup_fast(..., &inode, ...) return 0 and store NULL in inode, or
get NULL inode from pinned d_is_symlink() dentry after having grabbed
a reference and left RCU mode. Neither should be possible without either
something weird happening to lookup_fast() (and we would've seen oopsen in
link_path_walk() if that could happen, BTW) or screwed dentry refcounting
somewhere, combined with a race that managed to turn...

Oh, shit. No screwed refcounting is needed. Look:
BUG_ON(nd->flags & LOOKUP_RCU);
inode = d_backing_inode(path.dentry);
seq = 0; /* out of RCU mode, so the value doesn't matter */
if (unlikely(d_is_negative(path.dentry))) {
path_to_nameidata(&path, nd);
return -ENOENT;
}
Suppose we come here with negative path.dentry. We are holding a reference,
all right, and for a _postive_ dentry that would've been enough to keep
it positive. Not so for a negative one, though - symlink(2) on another
CPU doint d_instantiate() just before the d_is_negative() check and we
are fucked - inode is stale and we sail through all the checks, all the
way into should_follow_link().

We also have the same kind of crap in walk_component() -
err = lookup_slow(nd, &path);
if (err < 0)
return err;
inode = d_backing_inode(path.dentry);
seq = 0; /* we are already out of RCU mode */
err = -ENOENT;
if (d_is_negative(path.dentry))
goto out_path_put;
There it's much harder to hit, though - we need it not just d_instantiate()
overlapping those lines; we need the racing syscall to get from locking
the parent to d_instantiate() between the point where lookup_slow() has
unlocked the parent and d_is_negative(). And lookup_slow() couldn't have
gone into mountpoint crossing, so it's really hard to hit - you pretty
much have to get preempted just after fetching inode.

OK, the next delta to try, and there definitely are several commits in
that pile. It still does not explain the traces with GPF at 0x50, though -
for that we need not just a NULL getting to should_follow_link() but
something non-NULL with NULL at offset 40 from it (offset of ->i_sb in
struct inode). That something *can't* be a valid struct inode or had been
one in recent past - ->i_sb is assigned in new_inode(), value is always
non-NULL and never modified all the way until RCU-delayed freeing of struct
inode. So that has to be something entirely different... Anyway, the
patch so far follows:

diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index c6d7d3d..86f81e3 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -323,6 +323,7 @@ static struct dentry *autofs4_mountpoint_changed(struct path *path)
struct dentry *new = d_lookup(parent, &dentry->d_name);
if (!new)
return NULL;
+ WARN_ON(d_is_negative(new));
ino = autofs4_dentry_ino(new);
ino->last_used = jiffies;
dput(path->dentry);
diff --git a/fs/namei.c b/fs/namei.c
index f624d13..a5bcf63 100644
@@ -1743,11 +1757,11 @@ static int walk_component(struct nameidata *nd, int flags)
if (err < 0)
return err;

- inode = d_backing_inode(path.dentry);
seq = 0; /* we are already out of RCU mode */
err = -ENOENT;
if (d_is_negative(path.dentry))
goto out_path_put;
+ inode = d_backing_inode(path.dentry);
}

if (flags & WALK_PUT)
@@ -3192,12 +3206,12 @@ retry_lookup:
return error;

BUG_ON(nd->flags & LOOKUP_RCU);
- inode = d_backing_inode(path.dentry);
seq = 0; /* out of RCU mode, so the value doesn't matter */
if (unlikely(d_is_negative(path.dentry))) {
path_to_nameidata(&path, nd);
return -ENOENT;
}
+ inode = d_backing_inode(path.dentry);
finish_lookup:
if (nd->depth)
put_link(nd);
@@ -3206,11 +3220,6 @@ finish_lookup:
if (unlikely(error))
return error;

- if (unlikely(d_is_symlink(path.dentry)) && !(open_flag & O_PATH)) {
- path_to_nameidata(&path, nd);
- return -ELOOP;
- }
-
if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path.mnt) {
path_to_nameidata(&path, nd);
} else {
@@ -3229,6 +3238,10 @@ finish_open:
return error;
}
audit_inode(nd->name, nd->path.dentry, 0);
+ if (unlikely(d_is_symlink(nd->path.dentry)) && !(open_flag & O_PATH)) {
+ error = -ELOOP;
+ goto out;
+ }
error = -EISDIR;
if ((open_flag & O_CREAT) && d_is_dir(nd->path.dentry))
goto out;
@@ -3273,6 +3286,10 @@ opened:

Dmitry Vyukov

unread,
Feb 24, 2016, 5:03:28 AM2/24/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
Restarted testing with this patch (dropped previous patches because
they conflict).

These "unlikely" scenarios can be more likely inside of VMs where
effective preemption can happen at random points. Also NMIs probably
can increase probability of such races.

Dmitry Vyukov

unread,
Feb 24, 2016, 5:15:40 AM2/24/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
For now I can only say that I am hitting this one (3 times in 20 mins):

if (read_seqcount_retry(&link->dentry->d_seq, seq)) {
WARN_ON(1); // just as way to report hitting
// that path; it's OK to get
// here, in the final variant
// WARN_ON will disappear.

Dmitry Vyukov

unread,
Feb 24, 2016, 8:35:39 AM2/24/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
Fired after some time:

[ 3491.147607] ------------[ cut here ]------------
[ 3491.147986] WARNING: CPU: 1 PID: 17000 at fs/namei.c:1728
path_openat+0x14e5/0x1560()
[ 3491.148576] Modules linked in:
[ 3491.148818] CPU: 1 PID: 17000 Comm: syz-executor Tainted: G
W 4.5.0-rc5+ #72
[ 3491.149530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 3491.149530] 0000000000000000 ffff88002d6ebcb8 ffffffff8194ad79
0000000000000000
[ 3491.152103] ffffffff8334501c ffff88002d6ebcf0 ffffffff81172291
0000000000000000
[ 3491.152103] ffff88002d6ebd98 ffff88002d6ebde0 0000000000048000
ffff88002d6ebefc
[ 3491.155310] Call Trace:
[ 3491.155310] [<ffffffff8194ad79>] dump_stack+0x99/0xd0
[ 3491.155310] [<ffffffff81172291>] warn_slowpath_common+0x81/0xc0
[ 3491.157483] [<ffffffff81172385>] warn_slowpath_null+0x15/0x20
[ 3491.157483] [<ffffffff81318495>] path_openat+0x14e5/0x1560
[ 3491.157483] [<ffffffff81319549>] do_filp_open+0x79/0xd0
[ 3491.157483] [<ffffffff82b11d92>] ? _raw_spin_unlock+0x22/0x30
[ 3491.157483] [<ffffffff81328f08>] ? __alloc_fd+0xf8/0x200
[ 3491.157483] [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0
[ 3491.157483] [<ffffffff81306fbf>] SyS_openat+0xf/0x20
[ 3491.157483] [<ffffffff82b12836>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 3491.161521] ---[ end trace c65af9be7536efae ]---
[ 3491.161838] BUG: unable to handle kernel NULL pointer dereference
at 000000000000000c
[ 3491.162339] IP: [<ffffffff81327429>] atime_needs_update+0x9/0xc0
[ 3491.162577] PGD 2d626067 PUD 2c4cc067 PMD 0
[ 3491.163036] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 3491.163039] Modules linked in:
[ 3491.163039] CPU: 1 PID: 17000 Comm: syz-executor Tainted: G
W 4.5.0-rc5+ #72
[ 3491.163039] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 3491.164386] task: ffff88002c9196c0 ti: ffff88002d6e8000 task.ti:
ffff88002d6e8000
[ 3491.164386] RIP: 0010:[<ffffffff81327429>] [<ffffffff81327429>]
atime_needs_update+0x9/0xc0
[ 3491.164386] RSP: 0018:ffff88002d6ebcb0 EFLAGS: 00010282
[ 3491.165591] RAX: 0000000000000030 RBX: ffff88002d6ebde0 RCX: 0000000000000002
[ 3491.165591] RDX: ffff88002d6ebe38 RSI: 0000000000000000 RDI: ffff88002d6ebe38
[ 3491.165591] RBP: ffff88002d6ebcc0 R08: ffff88002d6ebe38 R09: 0000000000000001
[ 3491.167202] R10: 0000000000000001 R11: 0000000000001816 R12: 0000000000000000
[ 3491.167368] R13: ffff88002d6ebe38 R14: ffff880072508980 R15: ffff88002d6ebefc
[ 3491.167368] FS: 00007f02454e5700(0000) GS:ffff88003ed00000(0000)
knlGS:0000000000000000
[ 3491.167368] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3491.167368] CR2: 000000000000000c CR3: 000000002cb5f000 CR4: 00000000000006e0
[ 3491.167368] Stack:
[ 3491.169740] 0000000000000000 ffff88002d6ebde0 ffff88002d6ebd00
ffffffff81314922
[ 3491.169740] 0000000000000001 0000000000000000 ffff88002d6ebd98
ffff88002d6ebde0
[ 3491.169740] 0000000000048000 ffff88002d6ebefc ffff88002d6ebdd0
ffffffff8131729d
[ 3491.169740] Call Trace:
[ 3491.169740] [<ffffffff81314922>] trailing_symlink+0x62/0x260
[ 3491.169740] [<ffffffff8131729d>] path_openat+0x2ed/0x1560
[ 3491.169740] [<ffffffff81319549>] do_filp_open+0x79/0xd0
[ 3491.169740] [<ffffffff82b11d92>] ? _raw_spin_unlock+0x22/0x30
[ 3491.169740] [<ffffffff81328f08>] ? __alloc_fd+0xf8/0x200
[ 3491.169740] [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0
[ 3491.169740] [<ffffffff81306fbf>] SyS_openat+0xf/0x20
[ 3491.169740] [<ffffffff82b12836>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 3491.169740] Code: ff ff ff 48 85 c0 48 89 c3 74 08 48 89 c7 e8 ef
dc ff ff 48 89 d8 5b 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
48 83 ec 08 <f6> 46 0c 02 48 8b 1f 75 6b 48 8b 7e 28 48 8b 47 50 a9 01
04 00
[ 3491.169740] RIP [<ffffffff81327429>] atime_needs_update+0x9/0xc0
[ 3491.169740] RSP <ffff88002d6ebcb0>
[ 3491.169740] CR2: 000000000000000c
[ 3491.195051] ---[ end trace c65af9be7536efaf ]---
[ 3491.195051] BUG: sleeping function called from invalid context at
include/linux/sched.h:2795
[ 3491.195051] in_atomic(): 1, irqs_disabled(): 1, pid: 17000, name:
syz-executor
[ 3491.195051] INFO: lockdep is turned off.
[ 3491.195051] irq event stamp: 622
[ 3491.195051] hardirqs last enabled at (621): [<ffffffff811d9926>]
vprintk_emit+0x2d6/0x5f0
[ 3491.195051] hardirqs last disabled at (622): [<ffffffff82b14d49>]
error_entry+0x69/0xc0
[ 3491.195051] softirqs last enabled at (618): [<ffffffff81178172>]
__do_softirq+0x222/0x4a0
[ 3491.195051] softirqs last disabled at (595): [<ffffffff81178767>]
irq_exit+0xa7/0xc0
[ 3491.195051] CPU: 1 PID: 17000 Comm: syz-executor Tainted: G D
W 4.5.0-rc5+ #72
[ 3491.195051] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 3491.195051] 0000000000000000 ffff88002d6eb9a8 ffffffff8194ad79
ffff88002c9196c0
[ 3491.195051] 0000000000004268 ffff88002d6eb9d0 ffffffff811a0659
ffffffff8329b7fb
[ 3491.195051] 0000000000000aeb 0000000000000000 ffff88002d6eb9f8
ffffffff811a0764
[ 3491.195051] Call Trace:
[ 3491.195051] [<ffffffff8194ad79>] dump_stack+0x99/0xd0
[ 3491.195051] [<ffffffff811a0659>] ___might_sleep+0x179/0x240
[ 3491.195051] [<ffffffff811a0764>] __might_sleep+0x44/0x80
[ 3491.195051] [<ffffffff811842cf>] exit_signals+0x1f/0x130
[ 3491.195051] [<ffffffff811756ef>] do_exit+0xbf/0xd10
[ 3491.195051] [<ffffffff811da524>] ? kmsg_dump+0x104/0x180
[ 3491.195051] [<ffffffff8108549f>] oops_end+0x9f/0xe0
[ 3491.195051] [<ffffffff810ce6e8>] no_context+0x108/0x390
[ 3491.195051] [<ffffffff810cea8d>] __bad_area_nosemaphore+0x11d/0x220
[ 3491.195051] [<ffffffff810ceb9e>] bad_area_nosemaphore+0xe/0x10
[ 3491.195051] [<ffffffff810cf2c4>] __do_page_fault+0x84/0x470
[ 3491.195051] [<ffffffff810cf764>] trace_do_page_fault+0x74/0x2c0
[ 3491.195051] [<ffffffff810c9ba4>] do_async_page_fault+0x14/0x90
[ 3491.195051] [<ffffffff82b14b78>] async_page_fault+0x28/0x30
[ 3491.195051] [<ffffffff81327429>] ? atime_needs_update+0x9/0xc0
[ 3491.195051] [<ffffffff81314922>] trailing_symlink+0x62/0x260
[ 3491.195051] [<ffffffff8131729d>] path_openat+0x2ed/0x1560
[ 3491.195051] [<ffffffff81319549>] do_filp_open+0x79/0xd0
[ 3491.195051] [<ffffffff82b11d92>] ? _raw_spin_unlock+0x22/0x30
[ 3491.195051] [<ffffffff81328f08>] ? __alloc_fd+0xf8/0x200
[ 3491.195051] [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0
[ 3491.195051] [<ffffffff81306fbf>] SyS_openat+0xf/0x20
[ 3491.195051] [<ffffffff82b12836>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 3491.284831] note: syz-executor[17000] exited with preempt_count 1



The warning is this one:

static inline int should_follow_link(struct nameidata *nd, struct path *link,
int follow,
struct inode *inode, unsigned seq)
{
....

Al Viro

unread,
Feb 24, 2016, 10:15:20 AM2/24/16
to Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
On Wed, Feb 24, 2016 at 02:35:18PM +0100, Dmitry Vyukov wrote:

> The warning is this one:
>
> static inline int should_follow_link(struct nameidata *nd, struct path *link,
> int follow,
> struct inode *inode, unsigned seq)
> {
> ....
> WARN_ON(!inode); // now, _that_ should not happen.
> return pick_link(nd, link, inode, seq);
> }

Let's try it with less chatty should_follow_link() and better set of
tripwires on the way to it:

diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index c6d7d3d..86f81e3 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -323,6 +323,7 @@ static struct dentry *autofs4_mountpoint_changed(struct path *path)
struct dentry *new = d_lookup(parent, &dentry->d_name);
if (!new)
return NULL;
+ WARN_ON(d_is_negative(new));
ino = autofs4_dentry_ino(new);
ino->last_used = jiffies;
dput(path->dentry);
diff --git a/fs/namei.c b/fs/namei.c
index f624d13..071a4ba 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1209,6 +1209,7 @@ static int follow_managed(struct path *path, struct nameidata *nd)
/* Handle an automount point */
if (managed & DCACHE_NEED_AUTOMOUNT) {
ret = follow_automount(path, nd, &need_mntput);
+ WARN_ON(d_is_negative(path->dentry));
if (ret < 0)
break;
continue;
@@ -1582,8 +1583,10 @@ static int lookup_fast(struct nameidata *nd,
return -ENOENT;
path->mnt = mnt;
path->dentry = dentry;
- if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
+ if (likely(__follow_mount_rcu(nd, path, inode, seqp))) {
+ WARN_ON(!*inode);
return 0;
+ }
unlazy:
if (unlazy_walk(nd, dentry, seq))
return -ECHILD;
@@ -1613,8 +1616,10 @@ unlazy:
path->mnt = mnt;
path->dentry = dentry;
err = follow_managed(path, nd);
- if (likely(!err))
+ if (likely(!err)) {
*inode = d_backing_inode(path->dentry);
+ WARN_ON(!*inode);
+ }
return err;

need_lookup:
@@ -1712,6 +1717,12 @@ static inline int should_follow_link(struct nameidata *nd, struct path *link,
return 0;
if (!follow)
return 0;
+ /* make sure that d_is_symlink above matches inode */
+ if (nd->flags & LOOKUP_RCU) {
+ if (read_seqcount_retry(&link->dentry->d_seq, seq))
+ return -ECHILD;
+ }
+ WARN_ON(!inode); // now, _that_ should not happen.
return pick_link(nd, link, inode, seq);
}

@@ -1743,11 +1754,11 @@ static int walk_component(struct nameidata *nd, int flags)
if (err < 0)
return err;

- inode = d_backing_inode(path.dentry);
seq = 0; /* we are already out of RCU mode */
err = -ENOENT;
if (d_is_negative(path.dentry))
goto out_path_put;
+ inode = d_backing_inode(path.dentry);
}

if (flags & WALK_PUT)
@@ -3106,8 +3117,10 @@ static int do_last(struct nameidata *nd,
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
error = lookup_fast(nd, &path, &inode, &seq);
- if (likely(!error))
+ if (likely(!error)) {
+ WARN_ON(!inode);
goto finish_lookup;
+ }

if (error < 0)
return error;
@@ -3192,12 +3205,13 @@ retry_lookup:
return error;

BUG_ON(nd->flags & LOOKUP_RCU);
- inode = d_backing_inode(path.dentry);
seq = 0; /* out of RCU mode, so the value doesn't matter */
if (unlikely(d_is_negative(path.dentry))) {
path_to_nameidata(&path, nd);
return -ENOENT;
}
+ inode = d_backing_inode(path.dentry);
+ WARN_ON(!inode);

Dmitry Vyukov

unread,
Feb 25, 2016, 3:29:41 AM2/25/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin
On Wed, Feb 24, 2016 at 4:15 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> On Wed, Feb 24, 2016 at 02:35:18PM +0100, Dmitry Vyukov wrote:
>
>> The warning is this one:
>>
>> static inline int should_follow_link(struct nameidata *nd, struct path *link,
>> int follow,
>> struct inode *inode, unsigned seq)
>> {
>> ....
>> WARN_ON(!inode); // now, _that_ should not happen.
>> return pick_link(nd, link, inode, seq);
>> }
>
> Let's try it with less chatty should_follow_link() and better set of
> tripwires on the way to it:


Humm... I've left it running over night but no GPFs happened...
Usually they happened within two hours or so. I would think that your
patch fixes it and I did not actually apply it last time (or did not
rebuild kernel). But I saw the new warnings that the patch adds, so I
should have been rebuilt it...

What I saw is a dozen of pairs of warnings like the one below.
Is it possible the warning printing introduces enough delay to close
the inconsistency window?....


[46839.557154] ------------[ cut here ]------------
[46839.557663] WARNING: CPU: 0 PID: 10503 at fs/namei.c:1587
lookup_fast+0x2bf/0x420()
[46839.558239] Modules linked in:
[46839.558466] CPU: 0 PID: 10503 Comm: syz-executor Tainted: G
W 4.5.0-rc5+ #73
[46839.558980] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[46839.559174] 0000000000000000 ffff88002b6ebc48 ffffffff8194acd9
0000000000000000
[46839.559564] ffffffff83344ffc ffff88002b6ebc80 ffffffff81172291
ffff88002b6ebde0
[46839.559564] ffff8800314cc720 ffff88002b6ebd90 ffff88002b6ebd98
ffff88002b6ebd8c
[46839.560614] Call Trace:
[46839.560614] [<ffffffff8194acd9>] dump_stack+0x99/0xd0
[46839.561294] [<ffffffff81172291>] warn_slowpath_common+0x81/0xc0
[46839.561712] [<ffffffff81172385>] warn_slowpath_null+0x15/0x20
[46839.561712] [<ffffffff81314ddf>] lookup_fast+0x2bf/0x420
[46839.561712] [<ffffffff81315568>] ? link_path_walk+0x68/0x4e0
[46839.561712] [<ffffffff813172b5>] path_openat+0x375/0x1520
[46839.563137] [<ffffffff81319499>] do_filp_open+0x79/0xd0
[46839.563137] [<ffffffff82b11cf2>] ? _raw_spin_unlock+0x22/0x30
[46839.563137] [<ffffffff81328e58>] ? __alloc_fd+0xf8/0x200
[46839.564353] [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0
[46839.564353] [<ffffffff81306fbf>] SyS_openat+0xf/0x20
[46839.564353] [<ffffffff82b12776>] entry_SYSCALL_64_fastpath+0x16/0x7a
[46839.565558] ---[ end trace 3c3bc0f927bf4e90 ]---
[46839.565858] ------------[ cut here ]------------
[46839.566203] WARNING: CPU: 0 PID: 10503 at fs/namei.c:3121
path_openat+0x12bc/0x1520()
[46839.566700] Modules linked in:
[46839.566914] CPU: 0 PID: 10503 Comm: syz-executor Tainted: G
W 4.5.0-rc5+ #73
[46839.567154] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[46839.567154] 0000000000000000 ffff88002b6ebcb8 ffffffff8194acd9
0000000000000000
[46839.567154] ffffffff83344ffc ffff88002b6ebcf0 ffffffff81172291
0000000000000005
[46839.567154] ffff88002b6ebd98 0000000000048000 ffff88002b6ebde0
ffff88002b6ebefc
[46839.567154] Call Trace:
[46839.567154] [<ffffffff8194acd9>] dump_stack+0x99/0xd0
[46839.567154] [<ffffffff81172291>] warn_slowpath_common+0x81/0xc0
[46839.567154] [<ffffffff81172385>] warn_slowpath_null+0x15/0x20
[46839.567154] [<ffffffff813181fc>] path_openat+0x12bc/0x1520
[46839.567154] [<ffffffff81319499>] do_filp_open+0x79/0xd0
[46839.567154] [<ffffffff82b11cf2>] ? _raw_spin_unlock+0x22/0x30
[46839.567154] [<ffffffff81328e58>] ? __alloc_fd+0xf8/0x200
[46839.567154] [<ffffffff81306eb0>] do_sys_open+0x110/0x1f0
[46839.567154] [<ffffffff81306fbf>] SyS_openat+0xf/0x20
[46839.567154] [<ffffffff82b12776>] entry_SYSCALL_64_fastpath+0x16/0x7a
[46839.573763] ---[ end trace 3c3bc0f927bf4e91 ]---

Al Viro

unread,
Feb 25, 2016, 11:39:33 AM2/25/16
to Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds
On Thu, Feb 25, 2016 at 09:29:21AM +0100, Dmitry Vyukov wrote:

> Humm... I've left it running over night but no GPFs happened...
> Usually they happened within two hours or so. I would think that your
> patch fixes it and I did not actually apply it last time (or did not
> rebuild kernel). But I saw the new warnings that the patch adds, so I
> should have been rebuilt it...
>
> What I saw is a dozen of pairs of warnings like the one below.
> Is it possible the warning printing introduces enough delay to close
> the inconsistency window?....

All that stops these warnings from triggering atime_... oopsen is that dentry
involved isn't a symlink one. lookup_fast() should *not* return 0 and set
*inode to NULL. Ever. Trigger the same in walk_component() and you'll
get NULL nd->inode, which will oops as soon as you get to may_lookup() for
the next component (or atime one if dentry turns out to be a symlink and
not a directory).

IOW, what happened is that you've got dozens of instances of the underlying
bug triggered, all on non-symlink dentries in the last components of pathname.

The codepath in question is this:
*inode = d_backing_inode(dentry);
negative = d_is_negative(dentry);
if (read_seqcount_retry(&dentry->d_seq, seq))
return -ECHILD;
at that point we'd better have negative and *inode refering to the state of
dentry at the same moment - seq had been fetched before both the inode and
dentry flags and has remained unchanged until the later point, i.e. through
all the interval containing both fetches.
....
if (negative)
return -ENOENT;
no *inode changes in between, so it ought to be non-NULL
path->mnt = mnt;
path->dentry = dentry;
if (likely(__follow_mount_rcu(nd, path, inode, seqp))) {
WARN_ON(!*inode);
return 0;
and you are triggering that WARN_ON. So either __follow_mount_rcu() has
returned true and zeroed *inode, or we have something very wrong with
->d_seq.

__follow_mount_rcu() reassigns *inode only in one place:
mounted = __lookup_mnt(path->mnt, path->dentry);
if (!mounted)
break;
path->mnt = &mounted->mnt;
path->dentry = mounted->mnt.mnt_root;
nd->flags |= LOOKUP_JUMPED;
*seqp = read_seqcount_begin(&path->dentry->d_seq);
/*
* Update the inode too. We don't need to re-check the
* dentry sequence number here after this d_inode read,
* because a mount-point is always pinned.
*/
*inode = path->dentry->d_inode;
Note that it had returned true, so we have read_seqretry(&mount_lock, nd->m_seq)
yielding false, i.e. mount_lock hadn't been touched through all of that.
Having ->mnt_root->d_inode go NULL on a live vfsmount is a Bad Thing(tm), for
obvious reasons. ->mnt_root should've remained pinned until cleanup_mnt(),
which means that it must've already gotten through mntput_no_expire()
lock_mount_hash/unlock_mount_hash *before* we'd picked nd->m_seq; otherwise
we'd see mount_lock mismatch. Now, all removals from vfsmount hash should
happen under mount_lock and prior to cleanup_mnt() on victim. So we should've
had this:
Somebody:
hlist_del_init(&mounted->mnt_hash);
...
write_sequnlock(&mount_lock);
Us:
rcu_read_lock();
nd->m_seq = read_seqbegin(&mount_lock);
...
hlist_for_each_entry_rcu(p, head, mnt_hash)
... run into 'mounted'
find mount_lock untouched.

AFAICS, write_sequnlock/read_seqbegin barriers should've sufficed to
prevent that...

Hrm... OK, seeing that you still seem to trigger those within an hour or
two (and *any* of remaining WARN_ON() are serious bugs - none of the
"mitigation had been triggered" remained, sorry for not making it clear),
let's try this. Again, any WARN_ON triggered means that we'd caught something,
whether it progresses into oops or not.

diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index c6d7d3d..86f81e3 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -323,6 +323,7 @@ static struct dentry *autofs4_mountpoint_changed(struct path *path)
struct dentry *new = d_lookup(parent, &dentry->d_name);
if (!new)
return NULL;
+ WARN_ON(d_is_negative(new));
ino = autofs4_dentry_ino(new);
ino->last_used = jiffies;
dput(path->dentry);
diff --git a/fs/namei.c b/fs/namei.c
index f624d13..daa6b25 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1209,6 +1209,7 @@ static int follow_managed(struct path *path, struct nameidata *nd)
/* Handle an automount point */
if (managed & DCACHE_NEED_AUTOMOUNT) {
ret = follow_automount(path, nd, &need_mntput);
+ WARN_ON(d_is_negative(path->dentry));
if (ret < 0)
break;
continue;
@@ -1260,6 +1261,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
{
for (;;) {
struct mount *mounted;
+ void *p;
/*
* Don't forget we might have a non-mountpoint managed dentry
* that wants to block transit.
@@ -1289,7 +1291,9 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
* dentry sequence number here after this d_inode read,
* because a mount-point is always pinned.
*/
- *inode = path->dentry->d_inode;
+ p = *inode = path->dentry->d_inode;
+ if (unlikely(!p))
+ WARN_ON(!read_seqretry(&mount_lock, nd->m_seq));
}
return !read_seqretry(&mount_lock, nd->m_seq) &&
!(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT);
@@ -1580,10 +1584,12 @@ static int lookup_fast(struct nameidata *nd,
*/
if (negative)
return -ENOENT;
+ WARN_ON(!*inode); // ->d_seq was fucked somehow
path->mnt = mnt;
path->dentry = dentry;
- if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
+ if (likely(__follow_mount_rcu(nd, path, inode, seqp))) {
return 0;
+ }
unlazy:
if (unlazy_walk(nd, dentry, seq))
return -ECHILD;
@@ -1613,8 +1619,10 @@ unlazy:
path->mnt = mnt;
path->dentry = dentry;
err = follow_managed(path, nd);
- if (likely(!err))
+ if (likely(!err)) {
*inode = d_backing_inode(path->dentry);
+ WARN_ON(!*inode);
+ }
return err;

need_lookup:
@@ -1712,6 +1720,12 @@ static inline int should_follow_link(struct nameidata *nd, struct path *link,
return 0;
if (!follow)
return 0;
+ /* make sure that d_is_symlink above matches inode */
+ if (nd->flags & LOOKUP_RCU) {
+ if (read_seqcount_retry(&link->dentry->d_seq, seq))
+ return -ECHILD;
+ }
+ WARN_ON(!inode); // now, _that_ should not happen.
return pick_link(nd, link, inode, seq);
}

@@ -1743,11 +1757,11 @@ static int walk_component(struct nameidata *nd, int flags)
if (err < 0)
return err;

- inode = d_backing_inode(path.dentry);
seq = 0; /* we are already out of RCU mode */
err = -ENOENT;
if (d_is_negative(path.dentry))
goto out_path_put;
+ inode = d_backing_inode(path.dentry);
}

if (flags & WALK_PUT)
@@ -3106,8 +3120,10 @@ static int do_last(struct nameidata *nd,
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
error = lookup_fast(nd, &path, &inode, &seq);
- if (likely(!error))
+ if (likely(!error)) {
+ WARN_ON(!inode);
goto finish_lookup;
+ }

if (error < 0)
return error;
@@ -3192,12 +3208,13 @@ retry_lookup:
return error;

BUG_ON(nd->flags & LOOKUP_RCU);
- inode = d_backing_inode(path.dentry);
seq = 0; /* out of RCU mode, so the value doesn't matter */
if (unlikely(d_is_negative(path.dentry))) {
path_to_nameidata(&path, nd);
return -ENOENT;
}
+ inode = d_backing_inode(path.dentry);
+ WARN_ON(!inode);
finish_lookup:
if (nd->depth)
put_link(nd);
@@ -3206,11 +3223,6 @@ finish_lookup:
if (unlikely(error))
return error;

- if (unlikely(d_is_symlink(path.dentry)) && !(open_flag & O_PATH)) {
- path_to_nameidata(&path, nd);
- return -ELOOP;
- }
-
if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path.mnt) {
path_to_nameidata(&path, nd);
} else {
@@ -3229,6 +3241,10 @@ finish_open:
return error;
}
audit_inode(nd->name, nd->path.dentry, 0);
+ if (unlikely(d_is_symlink(nd->path.dentry)) && !(open_flag & O_PATH)) {
+ error = -ELOOP;
+ goto out;
+ }
error = -EISDIR;
if ((open_flag & O_CREAT) && d_is_dir(nd->path.dentry))
goto out;
@@ -3273,6 +3289,10 @@ opened:
goto exit_fput;
}
out:
+ if (unlikely(error > 0)) {
+ WARN_ON(1);
+ error = -EINVAL;
+ }
if (got_write)
mnt_drop_write(nd->path.mnt);
path_put(&save_parent);
diff --git a/fs/namespace.c b/fs/namespace.c
index 4fb1691..4128a5c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1060,6 +1060,8 @@ static void cleanup_mnt(struct mount *mnt)
* so mnt_get_writers() below is safe.
*/
WARN_ON(mnt_get_writers(mnt));
+ WARN_ON(!mnt->mnt.mnt_root->d_inode); // some joker has managed to
+ // make mnt_root negative on us
if (unlikely(mnt->mnt_pins.first))
mnt_pin_kill(mnt);
fsnotify_vfsmount_delete(&mnt->mnt);

Al Viro

unread,
Feb 26, 2016, 4:21:44 PM2/26/16
to Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds
On Thu, Feb 25, 2016 at 04:39:27PM +0000, Al Viro wrote:
> Hrm... OK, seeing that you still seem to trigger those within an hour or
> two (and *any* of remaining WARN_ON() are serious bugs - none of the
> "mitigation had been triggered" remained, sorry for not making it clear),
> let's try this. Again, any WARN_ON triggered means that we'd caught something,
> whether it progresses into oops or not.

Any news on that one? I'm going to carve fixes for understood bugs out of
that one and put those into tonight push, but it would be nice to sort out
all remaining crap lurking in that area...

Another question: what about the very first trace you'd posted, with apparent
GPF at 00000050? Have you seen anything like that afterwards?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Dmitry Vyukov

unread,
Feb 26, 2016, 4:25:43 PM2/26/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds
On Fri, Feb 26, 2016 at 10:21 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> On Thu, Feb 25, 2016 at 04:39:27PM +0000, Al Viro wrote:
>> Hrm... OK, seeing that you still seem to trigger those within an hour or
>> two (and *any* of remaining WARN_ON() are serious bugs - none of the
>> "mitigation had been triggered" remained, sorry for not making it clear),
>> let's try this. Again, any WARN_ON triggered means that we'd caught something,
>> whether it progresses into oops or not.
>
> Any news on that one? I'm going to carve fixes for understood bugs out of
> that one and put those into tonight push, but it would be nice to sort out
> all remaining crap lurking in that area...
>
> Another question: what about the very first trace you'd posted, with apparent
> GPF at 00000050? Have you seen anything like that afterwards?

No, I did not have time to retest.

GPF at 00000050 was not mine, it was Mickaël's.

I did not try to reproduce mine first. But most likely it is the same
as the one I reproduced lately (GPF at NULL in atime_needs_update).

Al Viro

unread,
Feb 26, 2016, 5:08:03 PM2/26/16
to Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds
On Fri, Feb 26, 2016 at 10:25:21PM +0100, Dmitry Vyukov wrote:
> On Fri, Feb 26, 2016 at 10:21 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> > On Thu, Feb 25, 2016 at 04:39:27PM +0000, Al Viro wrote:
> >> Hrm... OK, seeing that you still seem to trigger those within an hour or
> >> two (and *any* of remaining WARN_ON() are serious bugs - none of the
> >> "mitigation had been triggered" remained, sorry for not making it clear),
> >> let's try this. Again, any WARN_ON triggered means that we'd caught something,
> >> whether it progresses into oops or not.
> >
> > Any news on that one? I'm going to carve fixes for understood bugs out of
> > that one and put those into tonight push, but it would be nice to sort out
> > all remaining crap lurking in that area...
> >
> > Another question: what about the very first trace you'd posted, with apparent
> > GPF at 00000050? Have you seen anything like that afterwards?
>
> No, I did not have time to retest.
>
> GPF at 00000050 was not mine, it was Mickaël's.

Ah, OK - his is basically a forced nd->stack[] underrun, with passing a
never-assigned nd->link_inode to atime_needs_update(), so we are just
passing a contents of uninitialized stack word there and while it ends
up possible to dereference, it's not an address of struct inode and the
first attempt to follow a pointer in what would've been a struct inode
at that address (accessing inode->i_sb->s_flags) did blow up with GPF at
offsetof(struct super_block, s_flags).

All right, so we basically have several understood ones with fixes plus
something unknown that leads to lookup_fast() returning 0 with NULL in
*inode in about an hour or two on your setup...

Al Viro

unread,
Feb 27, 2016, 5:27:56 PM2/27/16
to Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds
BTW, what kind of userland are you using? The thing is, shared-subtree
setups differ, and if the crap is anywhere near vfsmount handling, that
could have some impact... So far I hadn't been able to trigger any of
these WARN_ON(); setup here is debian/testing on 4-way KVM guest with 4Gb
memory given to it running on a 6-way host (Phenom II X6 1100T, 3.3GHz, 16Gb
RAM total); 4.2 with debian/stable userland on host. What's the setup on
your reproducer?

Dmitry Vyukov

unread,
Feb 28, 2016, 10:43:42 AM2/28/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds
Restarted fuzzer with the latest patch on top of
0fcbf996d848d03573113d83f4e3fb3bcfa5ab5e.

> All that stops these warnings from triggering atime_... oopsen is that dentry
> involved isn't a symlink one.

What worries me is that I am running the same program in the same
setup. The program does operate on symlinks and previous it triggered
oopses. But now it does not. I've also rebased onto latest Linus tree,
maybe that made difference...

My userspace is a Debian Wheezy built using this script:
https://github.com/google/syzkaller/blob/master/tools/create-image.sh

I run it in qemu as:
$ qemu-system-x86_64 -hda wheezy.img -net
user,host=10.0.2.10,hostfwd=tcp::10022-:22 -net nic -nographic -kernel
arch/x86/boot/bzImage -append "console=ttyS0 root=/dev/sda debug
earlyprintk=serial slub_debug=UZ" -enable-kvm -pidfile vm_pid -m 2G
-numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 -smp
sockets=2,cores=2,threads=1 -usb -usbdevice mouse -usbdevice tablet
-soundhw all

I also use a pretty beefy config (attached) which includes KASAN and
KCOV both of which introduce significant slowdown and can affect
thread interleavings.
.config

Dmitry Vyukov

unread,
Feb 28, 2016, 11:04:39 AM2/28/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds
What was triggered so far is this. As far as I see it it roughly the
same as before.


[ 1422.292356] ------------[ cut here ]------------
[ 1422.292841] WARNING: CPU: 0 PID: 32603 at fs/namei.c:1587
lookup_fast+0x3fa/0x450()
[ 1422.293543] Modules linked in:
[ 1422.293868] CPU: 0 PID: 32603 Comm: syz-executor Not tainted 4.5.0-rc4+ #75
[ 1422.294426] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 1422.294482] 0000000000000000 ffff8800148d3c48 ffffffff81931fc9
0000000000000000
[ 1422.294482] ffffffff83314939 ffff8800148d3c80 ffffffff8116eee1
ffff8800148d3de0
[ 1422.294482] ffff8800148d3d90 ffff8800148d3d98 ffff8800148d3d8c
0000000000000001
[ 1422.294482] Call Trace:
[ 1422.294482] [<ffffffff81931fc9>] dump_stack+0x99/0xd0
[ 1422.294482] [<ffffffff8116eee1>] warn_slowpath_common+0x81/0xc0
[ 1422.294482] [<ffffffff8116efd5>] warn_slowpath_null+0x15/0x20
[ 1422.294482] [<ffffffff8130e89a>] lookup_fast+0x3fa/0x450
[ 1422.294482] [<ffffffff8130f388>] ? link_path_walk+0x68/0x4e0
[ 1422.294482] [<ffffffff8130fe66>] ? path_init+0x666/0x810
[ 1422.294482] [<ffffffff81310775>] path_openat+0x375/0x1520
[ 1422.294482] [<ffffffff811c780d>] ? trace_hardirqs_on+0xd/0x10
[ 1422.294482] [<ffffffff81313129>] do_filp_open+0x79/0xd0
[ 1422.294482] [<ffffffff82ae3022>] ? _raw_spin_unlock+0x22/0x30
[ 1422.294482] [<ffffffff81322af8>] ? __alloc_fd+0xf8/0x200
[ 1422.294482] [<ffffffff81300c10>] do_sys_open+0x110/0x1f0
[ 1422.294482] [<ffffffff81300d1f>] SyS_openat+0xf/0x20
[ 1422.294482] [<ffffffff82ae3ab6>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 1422.304062] ---[ end trace 658f7fb8fc01ebf0 ]---
[ 1422.304425] ------------[ cut here ]------------
[ 1422.304842] WARNING: CPU: 0 PID: 32603 at fs/namei.c:3124
path_openat+0x12bc/0x1520()
[ 1422.305551] Modules linked in:
[ 1422.305803] CPU: 0 PID: 32603 Comm: syz-executor Tainted: G
W 4.5.0-rc4+ #75
[ 1422.306476] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Bochs 01/01/2011
[ 1422.306476] 0000000000000000 ffff8800148d3cb8 ffffffff81931fc9
0000000000000000
[ 1422.306476] ffffffff83314939 ffff8800148d3cf0 ffffffff8116eee1
0000000000000005
[ 1422.306476] ffff8800148d3d98 0000000000048000 ffff8800148d3de0
ffff8800148d3efc
[ 1422.306476] Call Trace:
[ 1422.306476] [<ffffffff81931fc9>] dump_stack+0x99/0xd0
[ 1422.306476] [<ffffffff8116eee1>] warn_slowpath_common+0x81/0xc0
[ 1422.306476] [<ffffffff8116efd5>] warn_slowpath_null+0x15/0x20
[ 1422.306476] [<ffffffff813116bc>] path_openat+0x12bc/0x1520
[ 1422.306476] [<ffffffff81313129>] do_filp_open+0x79/0xd0
[ 1422.306476] [<ffffffff82ae3022>] ? _raw_spin_unlock+0x22/0x30
[ 1422.306476] [<ffffffff81322af8>] ? __alloc_fd+0xf8/0x200
[ 1422.306476] [<ffffffff81300c10>] do_sys_open+0x110/0x1f0
[ 1422.306476] [<ffffffff81300d1f>] SyS_openat+0xf/0x20
[ 1422.306476] [<ffffffff82ae3ab6>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 1422.314201] ---[ end trace 658f7fb8fc01ebf1 ]---
INIT: Id "V0" respawning too fast: disabled for 5 minutes

Al Viro

unread,
Feb 28, 2016, 12:01:44 PM2/28/16
to Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds, David Howells
[dhowells Cc'd]

On Sun, Feb 28, 2016 at 05:04:19PM +0100, Dmitry Vyukov wrote:

>
> [ 1422.292356] ------------[ cut here ]------------
> [ 1422.292841] WARNING: CPU: 0 PID: 32603 at fs/namei.c:1587
> lookup_fast+0x3fa/0x450()

Huh? So you have
dentry = __d_lookup_rcu(parent, &nd->last, &seq);
returning non-NULL dentry, then
*inode = d_backing_inode(dentry);
negative = d_is_negative(dentry);
if (read_seqcount_retry(&dentry->d_seq, seq))
return -ECHILD;
followed by by *inode == NULL and negative == true?

Nuts... OK, that removes vfsmounts from consideration, but... How the
fuck is that possible? We have
smp_rmb();
seq = &dentry->d_seq->sequence & ~1;
see that ->d_name and ->d_parent match what we are looking for,
then
*inode = dentry->d_inode;
type = READ_ONCE(dentry->d_flags);
smp_rmb();
negative = (type & DCACHE_ENTRY_TYPE) == DCACHE_MISS_TYPE;
smp_rmb();
if (dentry->d_seq->sequence != seq)
sod off
and observe *inode == NULL && !negative

Erm... What's to order ->d_inode and ->d_flags fetches there? David?
Looks like the barrier in d_is_negative() is on the wrong side of fetch.
Confused...

Al Viro

unread,
Feb 28, 2016, 3:01:06 PM2/28/16
to Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds, David Howells
On Sun, Feb 28, 2016 at 05:01:34PM +0000, Al Viro wrote:

> Erm... What's to order ->d_inode and ->d_flags fetches there? David?
> Looks like the barrier in d_is_negative() is on the wrong side of fetch.
> Confused...

OK, as per David's suggestion, let's flip them around, bringing the
barrier in d_is_negative() between them. Dmitry, could you try this on
top of mainline? Again, it's until the first warning.

diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index c6d7d3d..86f81e3 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -323,6 +323,7 @@ static struct dentry *autofs4_mountpoint_changed(struct path *path)
struct dentry *new = d_lookup(parent, &dentry->d_name);
if (!new)
return NULL;
+ WARN_ON(d_is_negative(new));
ino = autofs4_dentry_ino(new);
ino->last_used = jiffies;
dput(path->dentry);
diff --git a/fs/namei.c b/fs/namei.c
index 9c590e0..630d222 100644
@@ -1550,8 +1554,8 @@ static int lookup_fast(struct nameidata *nd,
* This sequence count validates that the inode matches
* the dentry name information from lookup.
*/
- *inode = d_backing_inode(dentry);
negative = d_is_negative(dentry);
+ *inode = d_backing_inode(dentry);
if (read_seqcount_retry(&dentry->d_seq, seq))
return -ECHILD;

@@ -1580,6 +1584,7 @@ static int lookup_fast(struct nameidata *nd,
*/
if (negative)
return -ENOENT;
+ WARN_ON(!*inode); // ->d_seq was fucked somehow
path->mnt = mnt;
path->dentry = dentry;
if (likely(__follow_mount_rcu(nd, path, inode, seqp)))
@@ -1613,8 +1618,10 @@ unlazy:
path->mnt = mnt;
path->dentry = dentry;
err = follow_managed(path, nd);
- if (likely(!err))
+ if (likely(!err)) {
*inode = d_backing_inode(path->dentry);
+ WARN_ON(!*inode);
+ }
return err;

need_lookup:
@@ -1717,6 +1724,7 @@ static inline int should_follow_link(struct nameidata *nd, struct path *link,
if (read_seqcount_retry(&link->dentry->d_seq, seq))
return -ECHILD;
}
+ WARN_ON(!inode); // now, _that_ should not happen.
return pick_link(nd, link, inode, seq);
}

@@ -3111,8 +3119,10 @@ static int do_last(struct nameidata *nd,
nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
/* we _can_ be in RCU mode here */
error = lookup_fast(nd, &path, &inode, &seq);
- if (likely(!error))
+ if (likely(!error)) {
+ WARN_ON(!inode);
goto finish_lookup;
+ }

if (error < 0)
return error;
@@ -3203,6 +3213,7 @@ retry_lookup:
return -ENOENT;
}
inode = d_backing_inode(path.dentry);
+ WARN_ON(!inode);
finish_lookup:
if (nd->depth)
put_link(nd);

Dmitry Vyukov

unread,
Feb 29, 2016, 4:38:21 AM2/29/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds, David Howells
On Sun, Feb 28, 2016 at 9:01 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> On Sun, Feb 28, 2016 at 05:01:34PM +0000, Al Viro wrote:
>
>> Erm... What's to order ->d_inode and ->d_flags fetches there? David?
>> Looks like the barrier in d_is_negative() is on the wrong side of fetch.
>> Confused...
>
> OK, as per David's suggestion, let's flip them around, bringing the
> barrier in d_is_negative() between them. Dmitry, could you try this on
> top of mainline? Again, it's until the first warning.


Good news, I was able to trigger these warnings on a plain C program:
https://gist.githubusercontent.com/dvyukov/1a81426b8a5dd3620d6f/raw/fe6d03cfb0d219ad3d979f8bd6c016a5a1b93212/gistfile1.txt
Unfortunately, the failure rate is significantly lower than with
syzkaller. Syzkaller triggered it 8 times in 18 hours on a single VM;
while the C program triggered it once on 2 VMs. Al, maybe you can
modify the program to increase failure rate? I would expect that we
need to clash 2 (or maybe 3) syscalls with right timing to trigger it.
You must have a better idea as to what are these syscalls.
P.S. this is still with the previous patch, not the latest one.

Dmitry Vyukov

unread,
Feb 29, 2016, 7:34:33 AM2/29/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds, David Howells
On Mon, Feb 29, 2016 at 10:38 AM, Dmitry Vyukov <dvy...@google.com> wrote:
> On Sun, Feb 28, 2016 at 9:01 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>> On Sun, Feb 28, 2016 at 05:01:34PM +0000, Al Viro wrote:
>>
>>> Erm... What's to order ->d_inode and ->d_flags fetches there? David?
>>> Looks like the barrier in d_is_negative() is on the wrong side of fetch.
>>> Confused...
>>
>> OK, as per David's suggestion, let's flip them around, bringing the
>> barrier in d_is_negative() between them. Dmitry, could you try this on
>> top of mainline? Again, it's until the first warning.


I am testing the new patch for several hours in several VMs now, I
would expect a WARNING to already fire. But no warnings fired so far.
I will keep it running. But meanwhile, do you an explanation of how:

- *inode = d_backing_inode(dentry);
negative = d_is_negative(dentry);
+ *inode = d_backing_inode(dentry);

can fix the bug? Explanation other than a missing rmw, because I am
running on x86 so I would not expect rmb to be the root cause (though,
it still may be necessary for other arches and to prevent possible
miscompilations). If you do have it and assuming that I will not see
warnings till tomorrow, then hopefully we can consider it as fixed!

It's not that I really understand what happens here, but looking at
the diff: is it the case that negative and inode can change under our
feet? If so, we still probably can get an inconsistent picture (i.e.
negative dentry but not NULL inode), can it be an issue? Is
non-negative->negative->non-negative->negative transition possible? If
so, we still probably can get the same crash regardless of order of
negative/inode loads.

Al Viro

unread,
Feb 29, 2016, 8:09:30 AM2/29/16
to Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds, David Howells
On Sun, Feb 28, 2016 at 08:01:01PM +0000, Al Viro wrote:
> On Sun, Feb 28, 2016 at 05:01:34PM +0000, Al Viro wrote:
>
> > Erm... What's to order ->d_inode and ->d_flags fetches there? David?
> > Looks like the barrier in d_is_negative() is on the wrong side of fetch.
> > Confused...
>
> OK, as per David's suggestion, let's flip them around, bringing the
> barrier in d_is_negative() between them. Dmitry, could you try this on
> top of mainline? Again, it's until the first warning.

Hmm... Reordering is definitely wrong, but what I really wonder is if
dentry_rcuwalk_invalidate() is right outside of __d_drop(). IOW, is
it right in __d_instantiate() and dentry_unlink_inode()? The code
dealing with ->d_flags in RCU mode is more interested in coherency between
->d_flags and ->d_inode and it looks like we'd need *two* increments -
even-to-odd before updating both and odd-to-even after both are in sync.
The more I look at the situation with d_is_...() wrt barriers and ->d_seq,
the less I understand it; outside of RCU mode we don't really need the
barriers for that stuff and in RCU mode ->d_flags handling had been
a serious headache all along...

I'm tempted to do as below; the amount of smp_wmb() remains the same and
so's the amount of stores (splitting that += 2 in two doesn't affect that -
we dirty the same cacheline before and after anyway). OTOH, that would
mean that ->d_seq match guarantees ->d_flags and ->d_inode being in sync.
And I suspect that we could drop _read_ barriers in d_is_...() after that;
in non-RCU mode we don't give a damn anyway and in RCU one ->d_seq check
would provide one; it doesn't really matter on x86, but smp_rmb() may be
costly. Splitting ->d_seq increments *does* matter on x86 wrt correctness;
in-between state becomes guaranteed ->d_seq mismatch and that just might
be it. Dmitry, could you add this on top of the previous patch?

David, Linus, do you see any problems with that? To me it looks saner
that way and as cheap as the current code, but I might be missing something
here...

diff --git a/fs/dcache.c b/fs/dcache.c
index 92d5140..2c08cce 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -279,7 +279,6 @@ static inline void __d_set_inode_and_type(struct dentry *dentry,
unsigned flags;

dentry->d_inode = inode;
- smp_wmb();
flags = READ_ONCE(dentry->d_flags);
flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
flags |= type_flags;
@@ -300,7 +299,6 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)

flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
WRITE_ONCE(dentry->d_flags, flags);
- smp_wmb();
dentry->d_inode = NULL;
}

@@ -370,9 +368,11 @@ static void dentry_unlink_inode(struct dentry * dentry)
__releases(dentry->d_inode->i_lock)
{
struct inode *inode = dentry->d_inode;
+
+ raw_write_seqcount_begin(&dentry->d_seq);
__d_clear_type_and_inode(dentry);
hlist_del_init(&dentry->d_u.d_alias);
- dentry_rcuwalk_invalidate(dentry);
+ raw_write_seqcount_end(&dentry->d_seq);
spin_unlock(&dentry->d_lock);
spin_unlock(&inode->i_lock);
if (!inode->i_nlink)
@@ -1758,8 +1758,9 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
spin_lock(&dentry->d_lock);
if (inode)
hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+ raw_write_seqcount_begin(&dentry->d_seq);
__d_set_inode_and_type(dentry, inode, add_flags);
- dentry_rcuwalk_invalidate(dentry);
+ raw_write_seqcount_end(&dentry->d_seq);
spin_unlock(&dentry->d_lock);
fsnotify_d_instantiate(dentry, inode);
}

David Howells

unread,
Feb 29, 2016, 8:44:01 AM2/29/16
to Al Viro, dhow...@redhat.com, Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds
Al Viro <vi...@ZenIV.linux.org.uk> wrote:

> David, Linus, do you see any problems with that? To me it looks saner
> that way and as cheap as the current code, but I might be missing something
> here...

You're effectively converting to this:

read d_seq.sequence
smp_rmb()
read d_inode, d_flags
smp_rmb()
check d_seq.sequence

in the read path and this:

write d_seq.sequence
smp_wmb()
write d_inode, d_flags
smp_wmb()
write d_seq.sequence

This should work - especially if we're wangling these sequence points anyway,
and so have to pay the barrier penalties whatever. In fact, you actually take
a barrier out, I think.

I have had a problem with getting the ordering of d_inode and d_flags right
because of __d_clear_type_and_inode() where we're required to unset a dentry
so that it can be repurposed[*] as a negative dentry rather than replacing it.

This is something we for performance sake - and it's something we can only do
if the dentry isn't referenced, whereas if we properly followed the RCU model,
we would have to wait a grace period after delisting an unlinked dentry before
we could repurpose it - but that significantly slows down rename, unlink and
rmdir.

[*] Al and I disagree on whether this is a reuse or merely a change of state.

David

Dmitry Vyukov

unread,
Feb 29, 2016, 10:55:15 AM2/29/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds, David Howells
Regardless of whether reordering is wrong or not, do we see how it can
fix the WARNINGs/oopses? Because it does seem to. I've tried to revert
just this part:

- *inode = d_backing_inode(dentry);
negative = d_is_negative(dentry);
+ *inode = d_backing_inode(dentry);

And got:

[ 976.609688] WARNING: CPU: 0 PID: 12126 at fs/namei.c:1587
lookup_fast+0x3fa/0x450()
[ 976.626768] WARNING: CPU: 0 PID: 12126 at fs/namei.c:3123
path_openat+0x12bc/0x1520()

in 15 minutes.

In particular, applying this on top the previous patch will be
inconclusive, because I already don't see the warnings.

Al Viro

unread,
Feb 29, 2016, 11:11:51 AM2/29/16
to Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds, David Howells
On Mon, Feb 29, 2016 at 01:34:13PM +0100, Dmitry Vyukov wrote:

> It's not that I really understand what happens here, but looking at
> the diff: is it the case that negative and inode can change under our
> feet? If so, we still probably can get an inconsistent picture (i.e.
> negative dentry but not NULL inode), can it be an issue? Is
> non-negative->negative->non-negative->negative transition possible? If
> so, we still probably can get the same crash regardless of order of
> negative/inode loads.

Yes, we can - relying on the ordering is brittle and wrong. See other
posting for possible solution; at least that one has much more understandable
rules:
* ->d_seq is bumped before and after modifications of ->d_inode and
->d_flags, which both provides the barriers and (which is what matters for
x86) guarantees that ->d_seq match on recheck (which we do anyway) means
that ->d_inode and ->d_flags match each other.
* RCU users of that part of ->d_flags should be verified by ->d_seq
check (we already are doing that - should_follow_link() didn't and that was
one of the bugs that got fixed).
* non-RCU users either have the parent locked (which stabilizes
everything) or have dentry pinned and positive (ditto). Checking that
dentry is negative (either by looking at inode or flags) does not guarantee
that it will stay such unless the parent is locked anyway. IOW, the
games with barriers and order of assignments between ->d_inode and ->d_flags
do not actually buy us anything useful.
* in case of __dentry_kill() we do *NOT* surround the stores to
->d_inode/->d_flags with ->d_seq bumps, but that's safe since by that point
we had already done __d_drop(), so RCU reader either doesn't find the
dentry in the first place, or gets ->d_seq bumped (by 2) between the
moment it's been fetched by __d_lookup_rcu() finding the sucker and
the moment when ->d_inode/->d_flags get changed. If RCU reader gets in
before that, it sees consistent ->d_inode/->d_flags, as they used to be.
It will eventually fail to grab a reference to that dentry, but that's not
our problem. If it gets in late enough to see ->d_inode and/or ->d_flags
changed, it will fail ->d_seq check and ignore the values it has fetched.
Again, no need for a barrier between ->d_inode and ->d_flags stores in
that case.

As the matter of fact, I'm somewhat tempted to make
static void dentry_unlink_inode(struct dentry * dentry)
__releases(dentry->d_lock)
__releases(dentry->d_inode->i_lock)
{
struct inode *inode = dentry->d_inode;
bool hashed = !d_unhashed(dentry);

if (hashed)
raw_write_seqcount_begin(&dentry->d_seq);
__d_clear_type_and_inode(dentry);
hlist_del_init(&dentry->d_u.d_alias);
if (hashed)
raw_write_seqcount_end(&dentry->d_seq);
spin_unlock(&dentry->d_lock);
spin_unlock(&inode->i_lock);
if (!inode->i_nlink)
fsnotify_inoderemove(inode);
if (dentry->d_op && dentry->d_op->d_iput)
dentry->d_op->d_iput(dentry, inode);
else
iput(inode);
}

and replace dentry_iput() in its only caller with

if (dentry->d_inode)
dentry_unlink_inode(dentry); /* will drop ->d_lock */
else
spin_unlock(&dentry->d_lock);

That would get rid of annoying code duplication, but I would like to see
profiles - the cost of branches might very well get unpleasant. Not sure,
and that part definitely isn't a -stable fodder.

Al Viro

unread,
Feb 29, 2016, 11:19:58 AM2/29/16
to Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds, David Howells
On Mon, Feb 29, 2016 at 04:54:54PM +0100, Dmitry Vyukov wrote:

> Regardless of whether reordering is wrong or not, do we see how it can
> fix the WARNINGs/oopses? Because it does seem to. I've tried to revert
> just this part:
>
> - *inode = d_backing_inode(dentry);
> negative = d_is_negative(dentry);
> + *inode = d_backing_inode(dentry);
>
> And got:
>
> [ 976.609688] WARNING: CPU: 0 PID: 12126 at fs/namei.c:1587
> lookup_fast+0x3fa/0x450()
> [ 976.626768] WARNING: CPU: 0 PID: 12126 at fs/namei.c:3123
> path_openat+0x12bc/0x1520()
>
> in 15 minutes.

dentry going from negative to positive lookup_fast()
fetch NULL ->d_inode
store non-NULL ->d_inode
store new ->d_flags
fetch new ->d_flags
check ->d_seq
bump ->d_seq by 2

Change the order of fetches and you'll get rid of that scenario.

> In particular, applying this on top the previous patch will be
> inconclusive, because I already don't see the warnings.

Apply it with that reordering reversed, please.

Linus Torvalds

unread,
Feb 29, 2016, 11:45:38 AM2/29/16
to Al Viro, Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, David Howells
On Mon, Feb 29, 2016 at 5:09 AM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> The more I look at the situation with d_is_...() wrt barriers and ->d_seq,
> the less I understand it; outside of RCU mode we don't really need the
> barriers for that stuff and in RCU mode ->d_flags handling had been
> a serious headache all along...

Yeah, one of my least favorite "recent" vfs improvements.

> I'm tempted to do as below .. [ changing it to be unde the seqlock ]
>
> David, Linus, do you see any problems with that? To me it looks saner
> that way and as cheap as the current code, but I might be missing something
> here...

I'd absolutely love to see this. The memory ordering for the flags
updates and reading was always really confusing, and I hated how it
was hidden inside the random access functions. And apparently it
wasn't just confusing, it was buggy too.

But I'd love it _more_ if this also means that we can get rid of the
rmb's, which your patch didn't. Can we? Or does the ordering still
remain for some other issue?

Linus

Al Viro

unread,
Feb 29, 2016, 11:50:36 AM2/29/16
to Linus Torvalds, Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, David Howells
On Mon, Feb 29, 2016 at 08:45:37AM -0800, Linus Torvalds wrote:

> > David, Linus, do you see any problems with that? To me it looks saner
> > that way and as cheap as the current code, but I might be missing something
> > here...
>
> I'd absolutely love to see this. The memory ordering for the flags
> updates and reading was always really confusing, and I hated how it
> was hidden inside the random access functions. And apparently it
> wasn't just confusing, it was buggy too.
>
> But I'd love it _more_ if this also means that we can get rid of the
> rmb's, which your patch didn't. Can we? Or does the ordering still
> remain for some other issue?

In __d_entry_type(), you mean? Should be, along with READ_ONCE() there.
AFAICS, ordering shouldn't be an issue anymore...

Al Viro

unread,
Feb 29, 2016, 12:20:58 PM2/29/16
to Linus Torvalds, Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, David Howells
On Mon, Feb 29, 2016 at 04:50:31PM +0000, Al Viro wrote:

> In __d_entry_type(), you mean? Should be, along with READ_ONCE() there.
> AFAICS, ordering shouldn't be an issue anymore...

FWIW, I've pushed the candidate fix (including the above) into
vfs.git#for-linus; I'd prefer to wait for confirmation that dcache.c
part is sufficient to fix the problem (_without_ reordering in lookup_fast())
before sending a pull request, though.

Linus Torvalds

unread,
Feb 29, 2016, 12:24:05 PM2/29/16
to Al Viro, Dmitry Vyukov, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, David Howells
On Mon, Feb 29, 2016 at 8:50 AM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> In __d_entry_type(), you mean? Should be, along with READ_ONCE() there.
> AFAICS, ordering shouldn't be an issue anymore...

That's the one.

It results in those barriers in very subtle places, and it was always
unclear whether any of the users of those "d_is_*()" helper functions
really understood the subtle memory ordering involved.

The smp_rmb() is also potentially quite expensive on some architectures.

So getting rid of those hidden memory orderings would be a goodness
quite apart from fixing the bug Dmitry found.

Linus

Dmitry Vyukov

unread,
Feb 29, 2016, 1:20:12 PM2/29/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds, David Howells
OK, just wanted to make sure that we keep track of the situation.
Restarted testing with combined patch. Here is it just in case:
https://gist.githubusercontent.com/dvyukov/67fe363d5ce2e2b06c71/raw/4d1b6c23f8dff7e0f8e2e3cab7e50208fddb0570/gistfile1.txt

Dmitry Vyukov

unread,
Mar 1, 2016, 3:59:48 AM3/1/16
to Al Viro, Ian Kent, Mickaël Salaün, linux-...@vger.kernel.org, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Sasha Levin, Linus Torvalds, David Howells
No warnings/crashes in 15 hours on 3 VMs!
Reply all
Reply to author
Forward
0 new messages