[PATCH] vfs: fix race in rcu lookup of pruned dentry

Hugh Dickins

unread,

Jul 17, 2011, 5:10:01 PM7/17/11

to

Since the RCU dcache lookup came in, on five occasions a "cp -a" in my dual
kbuild swapping tests has failed because a source file vanished momentarily
e.g. cp: cannot open `/usr/src/linux/include/linux/cramfs_fs.h' for reading:
No such file or directory

That -ENOENT in walk_component: isn't it assuming we found a negative
dentry, before reaching the read_seqcount_retry which complete_walk
(or nameidata_drop_rcu_last before 3.0) would use to confirm a successful
lookup? And can't memory pressure prune a dentry, coming to dentry_kill
which __d_drops to unhash before dentry_iput resets d_inode to NULL, but
the dentry_rcuwalk_barrier between those is ineffective if the other end
ignores the seqcount?

Move !inode -ENOENT checks from walk_component down into do_lookup,
adding read_seqcount_retry confirmation in the RCU case. And use
path_put_conditional instead of path_to_nameidata (a reverse way of
doing the same) in the non-RCU case, to match the error case just above.

Signed-off-by: Hugh Dickins <hu...@google.com>
Cc: sta...@kernel.org
---
I think I now have a faster script (four slightly staggered "cp -a"s of
kernel source, run in limited memory) to reproduce this in hours rather
than weeks; but I'm still calibrating that, cannot yet say whether this
patch fixes it. Sending patch on ahead in case it's obvious to you.

fs/namei.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)

--- 3.0-git/fs/namei.c 2011-07-15 19:40:03.269867213 -0700
+++ linux/fs/namei.c 2011-07-17 10:08:17.586475293 -0700
@@ -1173,7 +1173,10 @@ static int do_lookup(struct nameidata *n
goto unlazy;
if (unlikely(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT))
goto unlazy;
- return 0;
+ if (*inode)
+ return 0;
+ if (!read_seqcount_retry(&dentry->d_seq, nd->seq))
+ return -ENOENT;
unlazy:
if (unlazy_walk(nd, dentry))
return -ECHILD;
@@ -1223,6 +1226,10 @@ retry:
return err;
}
*inode = path->dentry->d_inode;
+ if (!*inode) {
+ path_put_conditional(path, nd);
+ return -ENOENT;
+ }
return 0;
}

@@ -1276,15 +1283,10 @@ static inline int walk_component(struct
if (unlikely(type != LAST_NORM))
return handle_dots(nd, type);
err = do_lookup(nd, name, path, &inode);
- if (unlikely(err)) {
+ if (err) {
terminate_walk(nd);
return err;
}
- if (!inode) {
- path_to_nameidata(path, nd);
- terminate_walk(nd);
- return -ENOENT;
- }
if (unlikely(inode->i_op->follow_link) && follow) {
if (nd->flags & LOOKUP_RCU) {
if (unlikely(unlazy_walk(nd, path->dentry))) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

unread,

Jul 17, 2011, 6:10:01 PM7/17/11

to

On Sun, Jul 17, 2011 at 2:03 PM, Hugh Dickins <hu...@google.com> wrote:
>
> That -ENOENT in walk_component: isn't it assuming we found a negative
> dentry, before reaching the read_seqcount_retry which complete_walk
> (or nameidata_drop_rcu_last before 3.0) would use to confirm a successful
> lookup?

Hmm. I think you're right. The ENOENT will basically short-circuit the
full proper checks.

> And can't memory pressure prune a dentry, coming to dentry_kill
> which __d_drops to unhash before dentry_iput resets d_inode to NULL, but
> the dentry_rcuwalk_barrier between those is ineffective if the other end
> ignores the seqcount?

Yes. However, looking at it, I'm not very happy with your patch. It
doesn't really make sense to me to special-case the NULL inode and
only do a seq_retry for that case.

I kind of see why you do it for that particular bug, but at the same
time, it just makes me go "Eww". If that inode isn't NULL yet, you
then return the dentry that can get a NULL d_inode later. So the only
special thing there is that we happen to check for a NULL inode there.
What protects *later* checks for a NULL d_inode?

So my gut feel is that we should instead

- either remove the -ENOENT return at that point entirely, and move
it to after we have re-verified the dentry lookup for other reasons.
That looks pretty involved, though, and those paths do end up
accessing inode data structures etc, so it looks less than trivial.

OR

- simply just not clear d_inode at all in d_kill(), so that when we
prune a dentry due to memory pressure, it doesn't actually change the
state of the dentry.

and I think the second solution is the right one. It's kind of odd:
we'll have called down to the iput() routine, and the inode will be
"gone", but that is already true for the *normal* race of actually
deleting a file too, and we have that whole "inodes are RCU-released",
so the inode allocation will still exist.

So my gut feel is that we should instead just do this:

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -187,7 +187,6 @@ static void dentry_iput(struct dentry * dentry)
{
struct inode *inode = dentry->d_inode;
if (inode) {
- dentry->d_inode = NULL;
list_del_init(&dentry->d_alias);
spin_unlock(&dentry->d_lock);
spin_unlock(&inode->i_lock);

and see what the fall-out from that would be. Nobody should then *use*
the stale inode, because __d_drop has done that
dentry_rcuwalk_barrier(). So we avoid the NULL inode special case
entirely.

Comments?

The above (whitespace-damaged) patch may look trivial, but it is
*entirely* untested, and maybe my gut feel that the above is the right
way to solve the problem is just wrong.

Al, any reactions? Hugh, does the above patch work for your
stress-test case? Or, indeed, at all?

Linus

Linus Torvalds

unread,

Jul 17, 2011, 7:10:02 PM7/17/11

to

On Sun, Jul 17, 2011 at 3:00 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> The above (whitespace-damaged) patch may look trivial, but it is
> *entirely* untested, and maybe my gut feel that the above is the right
> way to solve the problem is just wrong.
>
> Al, any reactions? Hugh, does the above patch work for your
> stress-test case? Or, indeed, at all?

Hmm, it worked for my small test, including under memory pressure that
was shrinking dentries. But my test is not the kind of stress-test
that you clearly have, so that only means that it has not *really*
obvious breakages.

Al Viro

unread,

Jul 17, 2011, 7:20:01 PM7/17/11

to

On Sun, Jul 17, 2011 at 03:00:06PM -0700, Linus Torvalds wrote:

> Yes. However, looking at it, I'm not very happy with your patch. It
> doesn't really make sense to me to special-case the NULL inode and
> only do a seq_retry for that case.
>
> I kind of see why you do it for that particular bug, but at the same
> time, it just makes me go "Eww". If that inode isn't NULL yet, you
> then return the dentry that can get a NULL d_inode later. So the only
> special thing there is that we happen to check for a NULL inode there.
> What protects *later* checks for a NULL d_inode?
>
> So my gut feel is that we should instead
>
> - either remove the -ENOENT return at that point entirely, and move
> it to after we have re-verified the dentry lookup for other reasons.
> That looks pretty involved, though, and those paths do end up
> accessing inode data structures etc, so it looks less than trivial.
>
> OR
>
> - simply just not clear d_inode at all in d_kill(), so that when we
> prune a dentry due to memory pressure, it doesn't actually change the
> state of the dentry.

OR

- keep part of the patch from Hugh, treating negative in RCU mode as
"need to unlazy". Leaving everything else as-is. Normally we'll
confirm that sucker is negative and that'll be it. Or we'll see that
it's being evicted and fall back to non-lazy mode, which is what we'll
need to do anyway. IOW, this:

diff --git a/fs/namei.c b/fs/namei.c
index 5c867dd..48a38de 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1171,6 +1171,8 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
path->dentry = dentry;
if (unlikely(!__follow_mount_rcu(nd, path, inode)))
goto unlazy;
+ if (unlikely(!*inode))
+ goto unlazy;

if (unlikely(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT))
goto unlazy;

return 0;

Al Viro

unread,

Jul 17, 2011, 7:30:01 PM7/17/11

to

On Sun, Jul 17, 2011 at 03:59:26PM -0700, Linus Torvalds wrote:
> On Sun, Jul 17, 2011 at 3:00 PM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
> >
> > The above (whitespace-damaged) patch may look trivial, but it is
> > *entirely* untested, and maybe my gut feel that the above is the right
> > way to solve the problem is just wrong.
> >
> > Al, any reactions? Hugh, does the above patch work for your
> > stress-test case? Or, indeed, at all?
>
> Hmm, it worked for my small test, including under memory pressure that
> was shrinking dentries. But my test is not the kind of stress-test
> that you clearly have, so that only means that it has not *really*
> obvious breakages.

Let's not go there at the moment. I really don't want to pile on more
things needing audit that late in the cycle... I've posted what
I think is the minimal fix - two lines added in the lazy side of
do_lookup(); if Hugh can confirm that his breakage is gone with that,
I'd strongly suggest going with that one at the momemnt...

Speaking of things found late in cycle: today's catch so far is UFS
being buggered when exported over NFS (trivially fixed, d_splice_alias()
needed there) and cramfs calling d_add(dentry, ERR_PTR(-ENOMEM)) on
OOM... Plus yesterday cifs and ceph holes, plus exofs_get_parent()
oopsable due to confusion about calling conventions (it should return
ERR_PTR() on stale, not NULL).

Pending ones: lack of i_mutex in btrfs get_default_root() and bad misuse
of d_splice_alias() in there, *really* nasty one in nfs async sillyrename
(d_move() outside of i_mutex if unlink() goes sillyrename way and gets
SIGKILL while waiting for server to reply) and cifs_get_root() lacking
permission checks, i_mutex and calling d_materialise_unique() on possibly
hashed dentry. Plus a bunch of dubious places in ceph... ;-/ _And_
remaining fuckloads of places to wade through...

Linus Torvalds

unread,

Jul 17, 2011, 7:40:02 PM7/17/11

to

On Sun, Jul 17, 2011 at 4:16 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> OR
>
> - keep part of the patch from Hugh, treating negative in RCU mode as
> "need to unlazy".

No, urgh, that's horrible.

Not being able to do an RCU lookup of negative dentries would be
really sad. There are some loads where a negative dentry is the
*common* case.

Linus

Hugh Dickins

unread,

Jul 17, 2011, 7:40:02 PM7/17/11

to

On Sun, 17 Jul 2011, Linus Torvalds wrote:
> On Sun, Jul 17, 2011 at 2:03 PM, Hugh Dickins <hu...@google.com> wrote:
> >
> > That -ENOENT in walk_component: isn't it assuming we found a negative
> > dentry, before reaching the read_seqcount_retry which complete_walk
> > (or nameidata_drop_rcu_last before 3.0) would use to confirm a successful
> > lookup?
>
> Hmm. I think you're right. The ENOENT will basically short-circuit the
> full proper checks.
>
> > And can't memory pressure prune a dentry, coming to dentry_kill
> > which __d_drops to unhash before dentry_iput resets d_inode to NULL, but
> > the dentry_rcuwalk_barrier between those is ineffective if the other end
> > ignores the seqcount?
>
> Yes. However, looking at it, I'm not very happy with your patch. It
> doesn't really make sense to me to special-case the NULL inode and
> only do a seq_retry for that case.
>
> I kind of see why you do it for that particular bug, but at the same
> time, it just makes me go "Eww". If that inode isn't NULL yet, you
> then return the dentry that can get a NULL d_inode later. So the only
> special thing there is that we happen to check for a NULL inode there.
> What protects *later* checks for a NULL d_inode?

I was imagining that all the later uses of the inode were using
walk_component()'s local struct inode *inode, or nd->inode which
it sets on success. Until complete_walk(), or the next level down
of lookup, has validated that stage by checking nd->seq.

If any does dereference dentry->d_inode in between, then it would
already be oopsing in this situation, which I've not seen.

>
> So my gut feel is that we should instead
>
> - either remove the -ENOENT return at that point entirely, and move
> it to after we have re-verified the dentry lookup for other reasons.
> That looks pretty involved, though, and those paths do end up
> accessing inode data structures etc, so it looks less than trivial.
>
> OR
>
> - simply just not clear d_inode at all in d_kill(), so that when we
> prune a dentry due to memory pressure, it doesn't actually change the
> state of the dentry.

But whether my imagining was right or wrong, your -1 line patch
looks a much nicer solution.

>
> and I think the second solution is the right one. It's kind of odd:
> we'll have called down to the iput() routine, and the inode will be
> "gone", but that is already true for the *normal* race of actually
> deleting a file too, and we have that whole "inodes are RCU-released",
> so the inode allocation will still exist.

Yes, the inodes are RCU-released too, so that side of it fits okay.

>
> So my gut feel is that we should instead just do this:
>
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -187,7 +187,6 @@ static void dentry_iput(struct dentry * dentry)
> {
> struct inode *inode = dentry->d_inode;
> if (inode) {
> - dentry->d_inode = NULL;
> list_del_init(&dentry->d_alias);
> spin_unlock(&dentry->d_lock);
> spin_unlock(&inode->i_lock);
>
> and see what the fall-out from that would be. Nobody should then *use*
> the stale inode, because __d_drop has done that
> dentry_rcuwalk_barrier(). So we avoid the NULL inode special case
> entirely.
>
> Comments?

At first it looked worrying to interfere with the sequence
"inode = dentry->d_inode; if (inode) { dentry->d_inode = NULL;"
but seeing as dentry_iput() is only called from the one place,
I think the test is merely about negative dentries, and setting
d_inode to NULL is just long-standing tidiness, nothing vital.

>
> The above (whitespace-damaged) patch may look trivial, but it is
> *entirely* untested, and maybe my gut feel that the above is the right
> way to solve the problem is just wrong.
>
> Al, any reactions? Hugh, does the above patch work for your
> stress-test case? Or, indeed, at all?

Well, my stress tests don't blow up in the first half hour.
But I was not successful in re-reproducing the issue (I got
yesterday in 2.5 hours) within 7 hours, so it's probably going
to take days to be sure that your minus-1-liner fixes it.

I certainly like your patch, though I'm not quite as confident
with it as I was with mine (until you pointed out its inconsistency).
Just a nagging doubt that leaving d_inode set may be wrong for somewhere.

Hugh

Hugh Dickins

unread,

Jul 17, 2011, 7:50:01 PM7/17/11

to

On Sun, 17 Jul 2011, Linus Torvalds wrote:

> On Sun, Jul 17, 2011 at 4:16 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> >
> > OR
> >
> > - keep part of the patch from Hugh, treating negative in RCU mode as
> > "need to unlazy".
>
> No, urgh, that's horrible.
>
> Not being able to do an RCU lookup of negative dentries would be
> really sad. There are some loads where a negative dentry is the
> *common* case.

Yes, that worried me too. But I can see Al's point, it looks like
he has enough on his plate for now.

This is not a fresh regression: I vote we make no change for 3.0,
I give your minus-1-liner more testing, hopefully with confirmation,
and we get the fix into 3.0.1 (if that's what stable is to be called).
Or into 3.0 if there's going to be an -rc8, I was assuming not.

Hugh

Al Viro

unread,

Jul 17, 2011, 8:00:01 PM7/17/11

to

On Sun, Jul 17, 2011 at 04:38:24PM -0700, Linus Torvalds wrote:
> On Sun, Jul 17, 2011 at 4:16 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> >
> > OR
> >

> > ?- keep part of the patch from Hugh, treating negative in RCU mode as

> > "need to unlazy".
>
> No, urgh, that's horrible.
>
> Not being able to do an RCU lookup of negative dentries would be
> really sad. There are some loads where a negative dentry is the
> *common* case.

No. Check the patch, please - what it does is exactly "if RCU lookup
in the middle of pathname gave us a negative dentry, check that it's
really negative ASAP".

Negative on the last component is not affected by that and there we
*do* go unlazy immediately anyway. If dentry in the middle of pathname
is really negative and not stale, we'll get unlazy_walk() check its
->d_seq and fall through the rest - all the way to failure exit in
walk_component(). Yes, we'll bump and drop ->d_count. On that
negative dentry in the middle.

On the other hand, if that sucker is stale, unlazy_walk() will check
->d_seq and bugger off. And no matter what we do, that pathname
resolution is going to have to be done in non-RCU mode at that point
and earlier is better.

Your variant could only lead to walking deeper into the tree before
we discover a stale dentry. Because it *can't* have non-stale
descendents anymore. Just more work for us...

IOW, once we run into negative dentry in RCU mode in do_lookup(), we
really need to drop out of RCU mode ASAP. Note that it's *NOT* about
not finding negative dentries in RCU dcache lookups or any such
silliness - that, of course, would be dumb.

Linus Torvalds

unread,

Jul 17, 2011, 8:00:02 PM7/17/11

to

On Sun, Jul 17, 2011 at 4:31 PM, Hugh Dickins <hu...@google.com> wrote:
>>
>> I kind of see why you do it for that particular bug, but at the same
>> time, it just makes me go "Eww". If that inode isn't NULL yet, you
>> then return the dentry that can get a NULL d_inode later. So the only
>> special thing there is that we happen to check for a NULL inode there.
>> What protects *later* checks for a NULL d_inode?
>
> I was imagining that all the later uses of the inode were using
> walk_component()'s local struct inode *inode, or nd->inode which
> it sets on success. Until complete_walk(), or the next level down
> of lookup, has validated that stage by checking nd->seq.

You are very likely right. The code *should* look up the inode
information before it actually does the final RCU check, since after
the sequence number check the window is open again.

So it's very possible that your patch is a real fix, and doesn't
really have any problems. It's just that I personally hate it ;)

So the reason I prefer my one-liner is not because it's smaller per se
- that's just a nice side issue. The reason I prefer my version is
that I think the current dentry pruning is actively wrong in turning a
dentry into a negative one due to memory pressure. So I think the name
lookup code is "correct", and the bug is literally that dentry pruning
(knowingly - there's even a comment about how the dentry is still
reachable for RCU lookups) turned a visible dentry into a negative one
even though nobody deleted it.

So I see your hack as being a workaround for the real bug, rather than a fix.

> If any does dereference dentry->d_inode in between, then it would
> already be oopsing in this situation, which I've not seen.

Good point, and it does probably mean that nobody accesses d_inode any more.

But another way of thinking about that is that the old lookup code
just created a cached copy of the d_inode pointer, so the code
literally always used a "stale" d_inode thing, and always relied on
the RCU freeing of inodes. So in a very real sense, my patch to just
remove the "d_inode = NULL" doesn't *change* anything. It just means
that now the dentry doesn't ever look negative.

So it should be a very safe patch, in that any stale inode pointer
issues are pre-existing, and not newly introduced issues by not
clearing the pointer.

> At first it looked worrying to interfere with the sequence
> "inode = dentry->d_inode; if (inode) { dentry->d_inode = NULL;"
> but seeing as dentry_iput() is only called from the one place,
> I think the test is merely about negative dentries, and setting
> d_inode to NULL is just long-standing tidiness, nothing vital.

Indeed. I checked that the "d_iput()" functions don't do anything with
d_inode either, and the only other thing we do afterwards is to
kfree() the dentry and the name (which is often RCU-delayed,. of
course, so there's that indirection through the RCU-freeing thing).

> Well, my stress tests don't blow up in the first half hour.
> But I was not successful in re-reproducing the issue (I got
> yesterday in 2.5 hours) within 7 hours, so it's probably going
> to take days to be sure that your minus-1-liner fixes it.

If you can keep running it, that would be good. I was *really* hoping
to do the final 3.0 tomorrow, but if we have a known thing to be
worried about, I guess I can delay it a bit.

> Just a nagging doubt that leaving d_inode set may be wrong for somewhere.

So see above: I don't think it's conceptually any different at all
from the RCU lookup code caching d_inode in nd->inode (or the local
*inode pointer).

But it's possible I'm missing something. As mentioned, I think my
one-liner is the more "correct" fix, but it's certainly more indirect
and perhaps a bit "subtler".

Al Viro

unread,

Jul 17, 2011, 8:30:01 PM7/17/11

to

On Sun, Jul 17, 2011 at 04:47:45PM -0700, Hugh Dickins wrote:
> On Sun, 17 Jul 2011, Linus Torvalds wrote:
> > On Sun, Jul 17, 2011 at 4:16 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> > >
> > > OR
> > >

> > > ?- keep part of the patch from Hugh, treating negative in RCU mode as

> > > "need to unlazy".
> >
> > No, urgh, that's horrible.
> >
> > Not being able to do an RCU lookup of negative dentries would be
> > really sad. There are some loads where a negative dentry is the
> > *common* case.
>
> Yes, that worried me too.

Grr... What do you think we need to do when we find a negative dentry in
RCU lookup? We can
* check its ->d_seq, to see if it's valid
* somehow (e.g. with what Linus suggested) try to walk further and
leave staleness checks for dentries we would encounter later in the walk.

If it's in the end of pathname, there is no choice at all - we need to
check ->d_seq of this one. Right? And we do exactly that. If we see
that it's for real, fine we got ourselves a reference to negative dentry
and can go ahead. If it's something like stat(2), we'll just do dput()
and bugger off. Refcounts of everything on the path to it are unaffected,
no global locks played with...

If it's in the middle of pathname, we could try to delay the check. And
what would it buy us? If that dentry was stale, we'd just walk a bit
deeper. And found a stale one at some later point. And restarted pathname
resolution in non-lazy mode from the very beginning. If it wasn't stale
(i.e. real negative dentry), we *do* want -ENOENT and we'll get it as soon
as unlazy_walk() will check ->d_seq and we plod through the rest of
do_lookup() to check in walk_component(). We won't redo d_lookup(), lock
i_mutex, etc.

Hugh Dickins

unread,

Jul 17, 2011, 9:20:01 PM7/17/11

to

On Mon, 18 Jul 2011, Al Viro wrote:
> On Sun, Jul 17, 2011 at 04:47:45PM -0700, Hugh Dickins wrote:
> > On Sun, 17 Jul 2011, Linus Torvalds wrote:
> > > On Sun, Jul 17, 2011 at 4:16 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> > > >
> > > > OR
> > > >
> > > > ?- keep part of the patch from Hugh, treating negative in RCU mode as
> > > > "need to unlazy".
> > >
> > > No, urgh, that's horrible.
> > >
> > > Not being able to do an RCU lookup of negative dentries would be
> > > really sad. There are some loads where a negative dentry is the
> > > *common* case.
> >
> > Yes, that worried me too.
>
> Grr... What do you think we need to do when we find a negative dentry in
> RCU lookup? We can
> * check its ->d_seq, to see if it's valid
> * somehow (e.g. with what Linus suggested) try to walk further and
> leave staleness checks for dentries we would encounter later in the walk.
>
> If it's in the end of pathname, there is no choice at all - we need to
> check ->d_seq of this one. Right?

Do we?

> And we do exactly that. If we see
> that it's for real, fine we got ourselves a reference to negative dentry

Where do we get a reference to the negative dentry? I thought
walk_component() was just terminating the walk and failing with -ENOENT.

> and can go ahead. If it's something like stat(2), we'll just do dput()
> and bugger off. Refcounts of everything on the path to it are unaffected,
> no global locks played with...
>
> If it's in the middle of pathname, we could try to delay the check. And
> what would it buy us? If that dentry was stale, we'd just walk a bit
> deeper. And found a stale one at some later point. And restarted pathname
> resolution in non-lazy mode from the very beginning. If it wasn't stale
> (i.e. real negative dentry), we *do* want -ENOENT and we'll get it as soon
> as unlazy_walk() will check ->d_seq and we plod through the rest of
> do_lookup() to check in walk_component(). We won't redo d_lookup(), lock
> i_mutex, etc.

Sorry, I still don't get it.

I can see that Linus's minus-1-liner to the pruning end might lead
do_lookup() one step deeper into the mire if there's a race, because
there's no NULL-d_inode advance warning; but I don't see that race
as a case we need to optimize for.

And in the genuine negative dentry case: at present (with or without
my or Linus's patches) there is no final nd->seq check, is there?

Having found a negative dentry, we just get out at that point with
-ENOENT. Now, it's true that there might be a race in which the
negative dentry is being made positive just as we're getting out;
but since we're not digging in any deeper, that seems to me just
a natural inevitable race, nothing to be guarded against.

Whereas your "if (!*inode) goto unlazy;" is adding an additional
sequence check, isn't it? Is that a race you see as important?
If so, it is a different matter than I was ever considering.

Or am I looking in the wrong place, focussing on walk_component()
where my -ENOENT came from, and should be looking somewhere else?
I've not looked to see where a negative dentry is made positive.

And your two patches are not mutually exclusive: we could use
both of them together if there's good reason (I like Linus's
argument that pruning is wrong to make dentry appear negative).

Hugh

p.s. As ever, there may be better uses of your time, than trying
to get your points into my skull: if Linus is satisfied by your
argument, please just ignore me!

Al Viro

unread,

Jul 17, 2011, 10:10:02 PM7/17/11

to

On Sun, Jul 17, 2011 at 06:13:34PM -0700, Hugh Dickins wrote:

> And in the genuine negative dentry case: at present (with or without
> my or Linus's patches) there is no final nd->seq check, is there?

For stat() - no, there isn't. We really bail out with -ENOENT, no matter
what. Racy, which is what you are hitting.

> Having found a negative dentry, we just get out at that point with
> -ENOENT. Now, it's true that there might be a race in which the
> negative dentry is being made positive just as we're getting out;
> but since we're not digging in any deeper, that seems to me just
> a natural inevitable race, nothing to be guarded against.
>
> Whereas your "if (!*inode) goto unlazy;" is adding an additional
> sequence check, isn't it? Is that a race you see as important?
> If so, it is a different matter than I was ever considering.

Umm... I really don't think that trying to avoid that ->d_seq comparison
is worth doing. I understand your point, but... IMO it's safer to leave
->d_inode cleaning as-is, at least for now. Hell knows; I'd be a lot
more optimistic if fs/dcache.c had been in better shape. As it is, *and*
considering that we are within a spitting distance from -final...

It's not as if we'd been talking about a heavy contention or cacheline
bouncing here, let alone doing ->lookup(). I'm not entirely convinced
that it's a valid optimization in the first place (probably is, but I'm
seriously scared by the complexity we already have there), and I'm really
not fond of the idea of dealing with whatever subtle crap we might
discover with Linus' patch. Again, dcache is not in a healthy shape
right now; at this point dumb and straightforward is, IMO, better than
subtle and risking to step on toes of very odd code out there.

Hell, stuff that walks towards root with just rcu_read_lock, checking
->d_inode as it goes and making assumptions about the lifecycle
stages of the dentries it sees is not even all that weird, compared to
the code actually in there ;-/

Once we are done with code audit, sure, I'm fine with ->d_inode being
kept until dentry is actually freed. Any code that relies on that thing
being cleared is asking for trouble and should be rewritten anyway.
The only thing is, it needs to be found before we rewrite it...

Speaking of weird code - could somebody explain what the deuce is going
on in dget_parent()?
rcu_read_lock();
ret = dentry->d_parent;
if (!ret) {
rcu_read_unlock();
goto out;
}
in the very beginning looks wrong - we *never* have NULL ->d_parent on
dentries that might be seen by anyone; d_alloc(NULL, name) callers all
set ->d_parent to dentry itself ASAP and we never put NULL there afterwards.

What the hell is going on there? Note that callers of dget_parent() will
be really unimpressed by getting NULL from it; most of them will cheerfully
dereference what they get. It seems to be Nick's change; is there anybody
who would remember reasons for it?

Linus Torvalds

unread,

Jul 18, 2011, 2:40:01 AM7/18/11

to

On Sun, Jul 17, 2011 at 7:08 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> On Sun, Jul 17, 2011 at 06:13:34PM -0700, Hugh Dickins wrote:
>
>> And in the genuine negative dentry case: at present (with or without
>> my or Linus's patches) there is no final nd->seq check, is there?
>
> For stat() - no, there isn't. We really bail out with -ENOENT, no matter
> what. Racy, which is what you are hitting.

The fundamental bug is in the dentry pruning. Not in lookup. If the
lookup sees a negative dentry, it will currently return ENOENT. Not
unlazy the thing like you claim. Your patch makes it unlazy the RCU
walk for a negative dentry and then retry that it currently doesn't -
for both the racy case (nobody cares, since that isn't performance
critical) _and_ for the normal real negative case.

And it's that "normal real negative case" that I really think that
your "goto unlazy" is just wrong. It's right for the sequence number
change, but not for the "!inode" case.

Now, I do agree that maybe that case simply should check the dentry
sequence count. I wish all cases did. Hugh patch did that. But the
reason I dislike Hugh's patch is that when I say "I wish they all
did", I mean that I dislike the special casing. And Hugh's patch just
adds *more* special casing for that NULL entry - I'd wish we just
always did it regardless of whether it was NULL or not.

So I claim that your patch changes *way* more than mine does. I think
I'd prefer Hugh's over that one. I still think that right now, my
one-liner is actually the one that changes things the least (and I
don't mean in size of patch, but in behavior and logic).

Linus

Hugh Dickins

unread,

Jul 18, 2011, 10:50:01 AM7/18/11

to

On Sun, 17 Jul 2011, Linus Torvalds wrote:
>

> So I claim that your patch changes *way* more than mine does. I think
> I'd prefer Hugh's over that one. I still think that right now, my
> one-liner is actually the one that changes things the least (and I
> don't mean in size of patch, but in behavior and logic).

A quick audit of .d_iputs suggests that they're all okay with yours;
but it only needs one of them to say BUG_ON(dentry->d_inode) for no
good reason, and the fix becomes much more dangerous than the tiny
race it's trying to fix.

I've entirely failed to reproduce the issue since I sent you the patch:
the last time I saw it was a couple of hours before sending the patch,
and that was the first time in weeks (partly because of vacation).
It's now looking as if I was just lucky to get it in hours that time.

I'm thinking there's some other factor, which I know nothing about,
which opens the window to make this issue more than theoretical.
I may be forced to opening the window artificially to get it to
happen often enough to even begin to verify that your patch,
or Al's, or mine, does not blow up in some unforeseen way.
And I'm unlikely to be testing filesystem dentry_iputs at all.

Your patch remains my preference, but given Al's uneasiness,
my vote remains to make no change now for 3.0, but put yours
in early for 3.1-rc.

Hugh

Linus Torvalds

unread,

Jul 18, 2011, 2:20:01 PM7/18/11

to

On Sun, Jul 17, 2011 at 11:31 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> Now, I do agree that maybe that case simply should check the dentry
> sequence count. I wish all cases did. Hugh patch did that. But the
> reason I dislike Hugh's patch is that when I say "I wish they all
> did", I mean that I dislike the special casing. And Hugh's patch just
> adds *more* special casing for that NULL entry - I'd wish we just
> always did it regardless of whether it was NULL or not.

Btw, looking at that, I think Hugh's patch is wrong. It does

if (!read_seqcount_retry(&dentry->d_seq, nd->seq))

but that's after we've done the __follow_mount_rcu() that may actually
have changed "nd->seq" to the mount-point inode (and has changed
path->dentry to match it).

Now, it only does it if inode is NULL, so I guess it doesn't matter,
but it's the kind of inconsistency that I think is really dangerous,
because it basically compares incompatible sequence numbers.

Also, looking at that whole mount-point traversal sequence, it looks
like __follow_mount_rcu() will happily totally ignore the old sequence
number when it replaces it with the mount-point sequence number. So it
looks to me that we have a case where we miss the sequence number
check that can happen with a positive dentry too!

No?

So I think that whenever we change "nd->seq", we should always heck
the previous sequence number first (the way do_lookup() itself does
for the *normal* traversal case). Otherwise we will have traversed the
mount-point without ever having checked the previous sequence number.

Something like the (untested) attached patch.

Comments? This mount-point case is independent of the negative dentry
issue, and probably never really an issue in practice, but...

Linus

mount-sequence.diff

Al Viro

unread,

Jul 18, 2011, 2:30:02 PM7/18/11

to

On Mon, Jul 18, 2011 at 11:11:40AM -0700, Linus Torvalds wrote:

> Also, looking at that whole mount-point traversal sequence, it looks
> like __follow_mount_rcu() will happily totally ignore the old sequence
> number when it replaces it with the mount-point sequence number. So it
> looks to me that we have a case where we miss the sequence number
> check that can happen with a positive dentry too!
>
> No?
>
> So I think that whenever we change "nd->seq", we should always heck
> the previous sequence number first (the way do_lookup() itself does
> for the *normal* traversal case). Otherwise we will have traversed the
> mount-point without ever having checked the previous sequence number.
>
> Something like the (untested) attached patch.
>
> Comments? This mount-point case is independent of the negative dentry
> issue, and probably never really an issue in practice, but...

->mnt_mountpoint and ->mnt_root are both pinned (and protected by
vfsmount_lock, while we are at it). If it manages to get stale,
we have worse problems...

Linus Torvalds

unread,

Jul 18, 2011, 3:10:02 PM7/18/11

to

On Mon, Jul 18, 2011 at 11:20 AM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> ->mnt_mountpoint and ->mnt_root are both pinned (and protected by
> vfsmount_lock, while we are at it). If it manages to get stale,
> we have worse problems...

I'm not worried about the inode itself being stale as much as the
*name* being stale, ie the whole "what happens if dentry->d_name is
being changed concurrently with lookup" issue.

But I do agree that it's very unlikely to actually cause problems.
People just don't move mount points around, and the race would seem to
be ridiculously hard to ever hit even if you tried to create some
totally artificial load that does nothing but rename mount-points ;)

That said, the reason I think we should do this is that (a) it's
really cheap to check the sequence numbers and (b) we should just make
it a rule that whenever you set up the next sequence number for
lookup, you should check the previous one (and do it in the right
order - you need to check the previous one *after* you've done the
read_seqcont_begin() for the next one, so that the sequence number
checks always "overlap").

So the patch is essentially a free "fix things to follow the general rules".

But I think I'm on the same page with Hugh that at this point none of
these issues are even remotely recent, they are basically impossible
to hit, and the fixes are likely potentially more dangerous than the
bugs they fix, so I'll release 3.0 without them and just apply the
fixes (and cc stable) afterwards.

Linus

Hugh Dickins

unread,

Jul 18, 2011, 3:10:03 PM7/18/11

to

On Mon, 18 Jul 2011, Linus Torvalds wrote:
> On Sun, Jul 17, 2011 at 11:31 PM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
> >
> > Now, I do agree that maybe that case simply should check the dentry
> > sequence count. I wish all cases did. Hugh patch did that. But the
> > reason I dislike Hugh's patch is that when I say "I wish they all
> > did", I mean that I dislike the special casing. And Hugh's patch just
> > adds *more* special casing for that NULL entry - I'd wish we just
> > always did it regardless of whether it was NULL or not.
>
> Btw, looking at that, I think Hugh's patch is wrong. It does
>
> if (!read_seqcount_retry(&dentry->d_seq, nd->seq))
>
> but that's after we've done the __follow_mount_rcu() that may actually
> have changed "nd->seq" to the mount-point inode (and has changed
> path->dentry to match it).

Yes, my patch is wrong there. I started out with
if (!read_seqcount_retry(&dentry->d_seq, seq);
but seeing __follow_mount_rcu() updates inode and nd->seq, I changed to
if (!read_seqcount_retry(&dentry->d_seq, nd->seq);
missing the obvious, that it's changing path->dentry when it updates nd->seq.

>
> Now, it only does it if inode is NULL, so I guess it doesn't matter,
> but it's the kind of inconsistency that I think is really dangerous,
> because it basically compares incompatible sequence numbers.

Yes, it's simply wrong.

>
> Also, looking at that whole mount-point traversal sequence, it looks
> like __follow_mount_rcu() will happily totally ignore the old sequence
> number when it replaces it with the mount-point sequence number. So it
> looks to me that we have a case where we miss the sequence number
> check that can happen with a positive dentry too!

Al has commented on that. I'd feel more confident with a patch like
yours, and corrected final check in mine (if we used mine at all)
if (!read_seqcount_retry(&path->dentry->d_seq, nd->seq);
but ignore me, I'm easily confused by mounts ;)

Hugh

Al Viro

unread,

Jul 18, 2011, 3:30:02 PM7/18/11

to

On Mon, Jul 18, 2011 at 12:08:04PM -0700, Linus Torvalds wrote:

> But I do agree that it's very unlikely to actually cause problems.
> People just don't move mount points around, and the race would seem to
> be ridiculously hard to ever hit even if you tried to create some
> totally artificial load that does nothing but rename mount-points ;)

vfs_rename() will refuse to move mountpoints. Required by POSIX, among
other things...

> But I think I'm on the same page with Hugh that at this point none of
> these issues are even remotely recent, they are basically impossible
> to hit, and the fixes are likely potentially more dangerous than the
> bugs they fix, so I'll release 3.0 without them and just apply the
> fixes (and cc stable) afterwards.

Hmm... Could you pull #for-linus (or #for-linus2, for that matter - same
plus cifs_get_root() fix) first?

Al Viro

unread,

Jul 18, 2011, 3:30:02 PM7/18/11

to

On Mon, Jul 18, 2011 at 08:20:38PM +0100, Al Viro wrote:

> Hmm... Could you pull #for-linus (or #for-linus2, for that matter - same
> plus cifs_get_root() fix) first?

FWIW, diffstat and changelog for #for-linus2:
fs/ceph/mds_client.c | 19 ++++++++-
fs/cifs/cifsfs.c | 100 ++++++++++++++-----------------------------------
fs/cifs/dir.c | 13 ++++++-
fs/cramfs/inode.c | 22 ++++++-----
fs/exofs/super.c | 2 +-
fs/hppfs/hppfs.c | 31 ++++-----------
fs/ufs/namei.c | 12 ++----
7 files changed, 83 insertions(+), 116 deletions(-)
Al Viro (8):
cifs: build_path_from_dentry() race fix
ceph analog of cifs build_path_from_dentry() race fix
fix exofs ->get_parent()
ufs should use d_splice_alias()
cramfs: get_cramfs_inode() returns ERR_PTR() on failure
hppfs: fix dentry leak
hppfs_lookup(): don't open-code lookup_one_len()
Fix cifs_get_root()

Linus Torvalds

unread,

Jul 18, 2011, 3:40:01 PM7/18/11

to

On Mon, Jul 18, 2011 at 12:04 PM, Hugh Dickins <hu...@google.com> wrote:
>
> Al has commented on that. I'd feel more confident with a patch like
> yours, and corrected final check in mine (if we used mine at all)
> if (!read_seqcount_retry(&path->dentry->d_seq, nd->seq);
> but ignore me, I'm easily confused by mounts ;)

No, I do think I want to do it. So for post-3.0 (likely today ;), my
current plan is to have three commits in this area:

- the one-liner removal of d_inode = NULL in d_kill (but I'll add a comment)

- the patch to properly do that dentry sequence number switch-over
across the mount-point traversal

- And possibly *some* version of your patch.

However, I'm actually looking at __d_lookup_rcu(), and it does the proper

if (read_seqcount_retry(&dentry->d_seq, *seq))
goto seqretry;

*after* having pulled the inode pointer out of "dentry->d_inode". So
I'm not actually seeing how the inode pointer could be stale at all
right now, because we do seem to have the sequence number check there!

Which just makes me suspect that it's *another* case of bug in the
dentry_kill() sequence. I think the dentry_rcuwalk_barrier() thing is
crap: it doesn't *cover* the d_inode change with the write lock, it
just puts that "barrier" there, which is bogus. You can get a
successful read of the changed inode without ever failing the read
lock sequence!

Linus

Linus Torvalds

unread,

Jul 18, 2011, 3:40:02 PM7/18/11

to

On Mon, Jul 18, 2011 at 12:20 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> Hmm... Could you pull #for-linus (or #for-linus2, for that matter - same
> plus cifs_get_root() fix) first?

I'd already done that, it just wasn't pushed out to master yet. Out now.

Except at the time I pulled, it was just the older 'for_linus' branch.

Linus

Al Viro

unread,

Jul 18, 2011, 3:50:01 PM7/18/11

to

On Mon, Jul 18, 2011 at 12:33:28PM -0700, Linus Torvalds wrote:

> Which just makes me suspect that it's *another* case of bug in the
> dentry_kill() sequence. I think the dentry_rcuwalk_barrier() thing is
> crap: it doesn't *cover* the d_inode change with the write lock, it
> just puts that "barrier" there, which is bogus. You can get a
> successful read of the changed inode without ever failing the read
> lock sequence!

Huh? We do __d_drop() in there, and do that before we start messing
with ->d_inode. And barrier here will make sure that anything that
might have found this sucker prior to that __d_drop() will fail
->d_seq check - it bumps ->d_seq by 2...

As long as ->d_inode change happens under ->d_lock and ->d_seq is
bumped along with that change, we should be fine, no?

Linus Torvalds

unread,

Jul 18, 2011, 4:30:02 PM7/18/11

to

On Mon, Jul 18, 2011 at 12:47 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> Huh? We do __d_drop() in there, and do that before we start messing
> with ->d_inode.

Hmm. Yes, looking at it, the ordering all seems correct. But then what
did Hugh see at all?

The inode thing he got from d_inode is re-verified by
__d_lookup_rcu(). So if inode is NULL, that means that the other CPU
has done dentry_iput(), which means that __d_drop has already
happened, which means that the dentry has been removed from the hash
list *and* the count has been incremented.

So just judging by Hugh's thing, something is wrong there. Some
missing barrier, perhaps. But write_seqcount_barrier() does seem to
have the write barrier, and the __d_lookup_rcu() read barrier check is
using the proper "full" read barrier too.

So in order for __d_lookup_rcu() to return a NULL inode, we have the
following requirements:
- it needs to find the dentry (duh)
- the dentry sequence count gets read (proper read barrier after this)
- the dentry must still look hashed
- the dentry->d_inode must be NULL
- and the dentry sequence count gets re-read (proper read barrier
before this) and must match.

And right now I don't see how that can happen, exactly because
__d_lookup_rcu does the sequence point check. But that's what Hugh's
patch depends on: seeing a NULL d_inode race.

If we see d_inode being NULL, that means that the first sequence
number read must happen *after* we've set d_inode to NULL in
dentry_iput(), which happens *after* we've done the sequence number
increment. That part is fine: that means that the sequence numbers
will match (because both of them are the "later" one). No
inconsistency so far. d_inode being NULL is perfectly compatible with
no sequence number change, and that was what I thought was a bug in
this area at first.

But if the first sequence number read (the read_seqcount_begin() in
__d_lookup_rcu()) sees the later sequence number, that means that
__d_drop has already happened, and it should *also* see the
dentry->d_hash.pprev = NULL; that happened in __d_drop before the
dentry_rcuwalk_barrier(). We have both the read barrier in the
read_seqcount_begin() and the write barrier in the
dentry_rcuwalk_barrier() to guarantee that.

But if the __d_lookup_rcu() sees that, then the d_unhashed() should
have seen the NULL pprev pointer, and not ever returned the dentry,
because that __d_lookup_rcu() loop does

if (d_unhashed(dentry))
continue;

after reading the sequence count.

So how could Hugh's NULL inode ever happen in the first place? Even
with the current sources? It all looks solid to me now that I look at
all the details.

Linus

Hugh Dickins

unread,

Jul 18, 2011, 5:30:01 PM7/18/11

to

On Mon, 18 Jul 2011, Linus Torvalds wrote:

> On Mon, Jul 18, 2011 at 12:47 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> >
> > Huh? We do __d_drop() in there, and do that before we start messing
> > with ->d_inode.
>
> Hmm. Yes, looking at it, the ordering all seems correct. But then what
> did Hugh see at all?
>
> The inode thing he got from d_inode is re-verified by
> __d_lookup_rcu(). So if inode is NULL, that means that the other CPU
> has done dentry_iput(), which means that __d_drop has already
> happened, which means that the dentry has been removed from the hash
> list *and* the count has been incremented.

__d_lookup_rcu() is being careful about *inode, yes.

But I'd forgotten it was even setting it: doesn't that setting get
overridden later by the more careless *inode = path->d_entry->d_inode
at the head of __follow_mount_rcu()'s loop?

Perhaps that line just needs to be moved to the tail of the loop?

Hugh

Linus Torvalds

unread,

Jul 18, 2011, 5:50:02 PM7/18/11

to

On Mon, Jul 18, 2011 at 2:19 PM, Hugh Dickins <hu...@google.com> wrote:
>
> __d_lookup_rcu() is being careful about *inode, yes.
>
> But I'd forgotten it was even setting it: doesn't that setting get
> overridden later by the more careless *inode = path->d_entry->d_inode
> at the head of __follow_mount_rcu()'s loop?
>
> Perhaps that line just needs to be moved to the tail of the loop?

Ahh. Bingo. Yes, I think you found it.

I don't think it should touch that *inode value in
__follow_mount_rcu() unless we actually followed a mount, exactly
because it will overwrite the thing that we were so careful about in
__d_lookup_rcu().

So how about this patch that replaces the earlier mount-point sequence
number one. The only difference is (as you mention) to just do the
*inode update at the end of the loop, so that we don't overwrite the
valid inode data with a non-checked one when we don't do anything.

Untested. But this should make my propised change to fs/dcache.c be
irrelevant, because whether we clear d_inode or not, the existing
sequence number checks will catch it. Agreed?

Linus

mount-sequence.diff

Hugh Dickins

unread,

Jul 18, 2011, 6:50:01 PM7/18/11

to

On Mon, 18 Jul 2011, Linus Torvalds wrote:

Yes, that looks like THE right fix to me, making others irrelevant.
But I haven't tested it either, won't begin to do so until tonight.
And I fear I won't notice if there's any silly little bug in there,
which forces every fast RCU lookup to go the slow way instead!

Inlined for the benefit of wider scrutiny:

fs/namei.c | 21 +++++++++++++++++++--
1 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5c867dd1c0b3..b9cd558a00be 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -938,11 +938,12 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
{
for (;;) {
struct vfsmount *mounted;
+ unsigned int seq;
+
/*
* Don't forget we might have a non-mountpoint managed dentry
* that wants to block transit.
*/
- *inode = path->dentry->d_inode;
if (unlikely(managed_dentry_might_block(path->dentry)))
return false;

@@ -952,9 +953,25 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
mounted = __lookup_mnt(path->mnt, path->dentry, 1);
if (!mounted)
break;
+ seq = read_seqcount_begin(&mounted->mnt_root->d_seq);
+
+ /*
+ * The memory barrier in read_seqcount_begin() is sufficient,
+ * so we can use __read_seqcount_retry() to check the prev
+ * sequence numbers.
+ */
+ if (!__read_seqcount_retry(&path->dentry->d_seq, nd->seq))
+ return false;
path->mnt = mounted;
path->dentry = mounted->mnt_root;
- nd->seq = read_seqcount_begin(&path->dentry->d_seq);
+ nd->seq = seq;
+
+ /*
+ * Update the inode too. We don't need to re-check the
+ * dentry sequence number here after this d_inode read,
+ * because a mount-point is always pinned.
+ */
+ *inode = path->dentry->d_inode;
}
return true;

Al Viro

unread,

Jul 18, 2011, 7:20:01 PM7/18/11

to

On Mon, Jul 18, 2011 at 03:43:29PM -0700, Hugh Dickins wrote:
> Yes, that looks like THE right fix to me, making others irrelevant.
> But I haven't tested it either, won't begin to do so until tonight.
> And I fear I won't notice if there's any silly little bug in there,
> which forces every fast RCU lookup to go the slow way instead!
>
> Inlined for the benefit of wider scrutiny:

Looks sane. Applied. I'll wait with pull request until you test it,
but that looks right.

Al Viro

unread,

Jul 18, 2011, 7:30:02 PM7/18/11

to

On Tue, Jul 19, 2011 at 12:17:38AM +0100, Al Viro wrote:
> On Mon, Jul 18, 2011 at 03:43:29PM -0700, Hugh Dickins wrote:
> > Yes, that looks like THE right fix to me, making others irrelevant.
> > But I haven't tested it either, won't begin to do so until tonight.
> > And I fear I won't notice if there's any silly little bug in there,
> > which forces every fast RCU lookup to go the slow way instead!
> >
> > Inlined for the benefit of wider scrutiny:
>
> Looks sane. Applied. I'll wait with pull request until you test it,
> but that looks right.

Linus, are you OK with me slapping your s-o-b on it?

Linus Torvalds

unread,

Jul 18, 2011, 7:30:02 PM7/18/11

to

On Mon, Jul 18, 2011 at 4:21 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> Linus, are you OK with me slapping your s-o-b on it?

Yup.

And it feels so right that I think we should just do it for 3.0.

I was planning on doing that today, but I guess can delay it too. I'm
at the Intel tech days tomorrow and the day after, but there's WiFi
there and I'll have my laptop. The slightly bigger problem is that it
will move the next merge window even further into my upcoming
vacation, but I was planning on having the laptop with me there too,
so whatever...

I assume I'll get the cifs_get_root() fix at the same time.

Linus

Al Viro

unread,

Jul 18, 2011, 7:50:02 PM7/18/11

to

On Mon, Jul 18, 2011 at 04:27:53PM -0700, Linus Torvalds wrote:
> On Mon, Jul 18, 2011 at 4:21 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> >
> > Linus, are you OK with me slapping your s-o-b on it?
>
> Yup.
>
> And it feels so right that I think we should just do it for 3.0.
>
> I was planning on doing that today, but I guess can delay it too. I'm
> at the Intel tech days tomorrow and the day after, but there's WiFi
> there and I'll have my laptop. The slightly bigger problem is that it
> will move the next merge window even further into my upcoming
> vacation, but I was planning on having the laptop with me there too,
> so whatever...
>
> I assume I'll get the cifs_get_root() fix at the same time.

Yes. Maybe btrfs get_default_root() one as well, if it survives beating;
also a missing i_mutex, along with seriously wrong use of d_splice_alias()
(it ignores the return value, for starters, and assumes that inode will
not be found in icache once all dentries pointing to it are off the alias
list)

Hugh Dickins

unread,

Jul 18, 2011, 10:10:01 PM7/18/11

to

On Tue, 19 Jul 2011, Al Viro wrote:
> On Mon, Jul 18, 2011 at 04:27:53PM -0700, Linus Torvalds wrote:
> > On Mon, Jul 18, 2011 at 4:21 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> > >
> > > Linus, are you OK with me slapping your s-o-b on it?
> >
> > Yup.
> >
> > And it feels so right that I think we should just do it for 3.0.

It looked sane, but in fact was insane. Enough testing for now,
I say it's good to go, if you fold in this small adjustment...

[PATCH] fix sense of __read_seqcount_retry test

Hah, good thing I hacked something in to check we're going the quick
way: __read_seqcount_retry() returns *true* when a retry is needed.

Signed-off-by: Hugh Dickins <hu...@google.com>
---
fs/namei.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- 3.0-git+/fs/namei.c 2011-07-18 18:31:36.273731128 -0700
+++ linux/fs/namei.c 2011-07-18 18:32:35.158022444 -0700
@@ -960,7 +960,7 @@ static bool __follow_mount_rcu(struct na

* so we can use __read_seqcount_retry() to check the prev

* sequence numbers.
*/
- if (!__read_seqcount_retry(&path->dentry->d_seq, nd->seq))
+ if (__read_seqcount_retry(&path->dentry->d_seq, nd->seq))

return false;
path->mnt = mounted;
path->dentry = mounted->mnt_root;

Linus Torvalds

unread,

Jul 18, 2011, 10:20:01 PM7/18/11

to

On Mon, Jul 18, 2011 at 7:07 PM, Hugh Dickins <hu...@google.com> wrote:
>
> It looked sane, but in fact was insane. Enough testing for now,
> I say it's good to go, if you fold in this small adjustment...

Good catch, that test was indeed the wrong way around. Thanks.

Linus

Linus Torvalds

unread,

Jul 18, 2011, 10:20:01 PM7/18/11

to

On Mon, Jul 18, 2011 at 7:14 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> Good catch, that test was indeed the wrong way around. Thanks.

Oh, and Al - just fold the thing into the original patch rather than
carry a buggy patch and the fix around. You might as well credit Hugh
as author for the whole thing, since he ended up being not just the
one to find the problem, but describe the solution anyway and fix the
patch.

Al Viro

unread,

Jul 18, 2011, 10:30:01 PM7/18/11

to

On Mon, Jul 18, 2011 at 07:07:54PM -0700, Hugh Dickins wrote:

> It looked sane, but in fact was insane. Enough testing for now,
> I say it's good to go, if you fold in this small adjustment...

Folded and pushed...

Al Viro

unread,

Jul 18, 2011, 10:30:01 PM7/18/11

to

On Mon, Jul 18, 2011 at 07:17:18PM -0700, Linus Torvalds wrote:
> On Mon, Jul 18, 2011 at 7:14 PM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
> >
> > Good catch, that test was indeed the wrong way around. Thanks.
>
> Oh, and Al - just fold the thing into the original patch rather than
> carry a buggy patch and the fix around. You might as well credit Hugh
> as author for the whole thing, since he ended up being not just the
> one to find the problem, but describe the solution anyway and fix the
> patch.

Umm... Is there any convenient way to make git commit --amend change
the author of commit?

Chris Ball

unread,

Jul 18, 2011, 10:40:02 PM7/18/11

to

Hi,

On Mon, Jul 18 2011, Al Viro wrote:
>> Oh, and Al - just fold the thing into the original patch rather than
>> carry a buggy patch and the fix around. You might as well credit Hugh
>> as author for the whole thing, since he ended up being not just the
>> one to find the problem, but describe the solution anyway and fix the
>> patch.
>
> Umm... Is there any convenient way to make git commit --amend change
> the author of commit?

`git commit --amend --author="Foo Bar <foo@bar>"` works fine.

--
Chris Ball <c...@laptop.org> <http://printf.net/>
One Laptop Per Child

Nicolas Pitre

unread,

Jul 19, 2011, 12:50:02 AM7/19/11

to

On Tue, 19 Jul 2011, Al Viro wrote:

> On Mon, Jul 18, 2011 at 07:17:18PM -0700, Linus Torvalds wrote:
> > On Mon, Jul 18, 2011 at 7:14 PM, Linus Torvalds
> > <torv...@linux-foundation.org> wrote:
> > >
> > > Good catch, that test was indeed the wrong way around. Thanks.
> >
> > Oh, and Al - just fold the thing into the original patch rather than
> > carry a buggy patch and the fix around. You might as well credit Hugh
> > as author for the whole thing, since he ended up being not just the
> > one to find the problem, but describe the solution anyway and fix the
> > patch.
>
> Umm... Is there any convenient way to make git commit --amend change
> the author of commit?

git commit --amend --author="Hugh Dickins <hu...@google.com>"

Nicolas

Al Viro

unread,

Jul 19, 2011, 7:50:01 PM7/19/11

to

You know what... I doubt that you want to mess with ->d_seq checks here.
It's definitely not Hugh's bug (unless he has bindings somewhere odd) and
both ->mnt_mountpoint and ->mnt_root are pinned (and we are holding
vfsmount_lock anyway). *inode assignment too early is a real bug, indeed,
and we want to assign nd->seq if we cross mountpoint as both versions do,
but check just before that is, in the best case, BUG_ON() fodder. We'd
just found a vfsmount with ->mnt_mountpoint equal to path->dentry; it *can't*
be stale, or we have a really nasty problem anyway.

Al Viro

unread,

Jul 19, 2011, 8:00:01 PM7/19/11

to

On Wed, Jul 20, 2011 at 12:45:51AM +0100, Al Viro wrote:

> You know what... I doubt that you want to mess with ->d_seq checks here.
> It's definitely not Hugh's bug (unless he has bindings somewhere odd) and
> both ->mnt_mountpoint and ->mnt_root are pinned (and we are holding
> vfsmount_lock anyway). *inode assignment too early is a real bug, indeed,
> and we want to assign nd->seq if we cross mountpoint as both versions do,
> but check just before that is, in the best case, BUG_ON() fodder. We'd
> just found a vfsmount with ->mnt_mountpoint equal to path->dentry; it *can't*
> be stale, or we have a really nasty problem anyway.

Kudos to neilb for spotting the pointless check, BTW; and no, his theory
that it might be needed since we could race with umount() is wrong - due to
vfsmount_lock being held.

Linus Torvalds

unread,

Jul 19, 2011, 8:00:01 PM7/19/11

to

On Tue, Jul 19, 2011 at 4:45 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> You know what... �I doubt that you want to mess with ->d_seq checks here.

I'm ok with doing just the "move *inode setting" patch.

And if we really don't even allow renames of those mounted things, I
guess the sequence number isn't required.

Linus

Al Viro

unread,

Jul 19, 2011, 8:00:02 PM7/19/11

to

On Wed, Jul 20, 2011 at 12:52:16AM +0100, Al Viro wrote:
> On Wed, Jul 20, 2011 at 12:45:51AM +0100, Al Viro wrote:
>
> > You know what... I doubt that you want to mess with ->d_seq checks here.
> > It's definitely not Hugh's bug (unless he has bindings somewhere odd) and
> > both ->mnt_mountpoint and ->mnt_root are pinned (and we are holding
> > vfsmount_lock anyway). *inode assignment too early is a real bug, indeed,
> > and we want to assign nd->seq if we cross mountpoint as both versions do,
> > but check just before that is, in the best case, BUG_ON() fodder. We'd
> > just found a vfsmount with ->mnt_mountpoint equal to path->dentry; it *can't*
> > be stale, or we have a really nasty problem anyway.
>
> Kudos to neilb for spotting the pointless check, BTW; and no, his theory
> that it might be needed since we could race with umount() is wrong - due to
> vfsmount_lock being held.

Updated, pushed. Please, pull from the usual place (
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git/ for-linus
).

Shortlog:
Al Viro (1):
Fix cifs_get_root()

Linus Torvalds (1):
vfs: fix race in rcu lookup of pruned dentry

Diffstat:
fs/cifs/cifsfs.c | 100 +++++++++++++++--------------------------------------
fs/namei.c | 10 +++++-
2 files changed, 38 insertions(+), 72 deletions(-)

Al Viro

unread,

Jul 19, 2011, 8:10:02 PM7/19/11

to

On Tue, Jul 19, 2011 at 04:56:51PM -0700, Linus Torvalds wrote:
> On Tue, Jul 19, 2011 at 4:45 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
> >

> > You know what... ?I doubt that you want to mess with ->d_seq checks here.

>
> I'm ok with doing just the "move *inode setting" patch.
>
> And if we really don't even allow renames of those mounted things, I
> guess the sequence number isn't required.

See vfs_rename_{dir,other}:

error = -EBUSY;
if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
goto out;

in both, before even trying to call the method. Traditional on all
Unices I've ever seen...

NeilBrown

unread,

Jul 19, 2011, 8:50:01 PM7/19/11

to

On Wed, 20 Jul 2011 00:55:15 +0100 Al Viro <vi...@ZenIV.linux.org.uk> wrote:

> On Wed, Jul 20, 2011 at 12:52:16AM +0100, Al Viro wrote:
> > On Wed, Jul 20, 2011 at 12:45:51AM +0100, Al Viro wrote:
> >
> > > You know what... I doubt that you want to mess with ->d_seq checks here.
> > > It's definitely not Hugh's bug (unless he has bindings somewhere odd) and
> > > both ->mnt_mountpoint and ->mnt_root are pinned (and we are holding
> > > vfsmount_lock anyway). *inode assignment too early is a real bug, indeed,
> > > and we want to assign nd->seq if we cross mountpoint as both versions do,
> > > but check just before that is, in the best case, BUG_ON() fodder. We'd
> > > just found a vfsmount with ->mnt_mountpoint equal to path->dentry; it *can't*
> > > be stale, or we have a really nasty problem anyway.
> >
> > Kudos to neilb for spotting the pointless check, BTW; and no, his theory
> > that it might be needed since we could race with umount() is wrong - due to
> > vfsmount_lock being held.

Thanks.. but that patch seems to introduce an unused variable "seq".

NeilBrown

Al Viro

unread,

Jul 19, 2011, 9:50:02 PM7/19/11

to

On Wed, Jul 20, 2011 at 10:47:21AM +1000, NeilBrown wrote:
> > >
> > > Kudos to neilb for spotting the pointless check, BTW; and no, his theory
> > > that it might be needed since we could race with umount() is wrong - due to
> > > vfsmount_lock being held.
>
> Thanks.. but that patch seems to introduce an unused variable "seq".
>
> NeilBrown

Grr... I can amend and push again, of course, but I've no idea if Linus
has already pulled into his tree ;-/ Still not there in tree on hera,
but that's not saying much. Linus?

Linus Torvalds

unread,

Jul 20, 2011, 1:00:01 AM7/20/11

to

On Tue, Jul 19, 2011 at 6:40 PM, Al Viro <vi...@zeniv.linux.org.uk> wrote:
>
> Grr... �I can amend and push again, of course, but I've no idea if Linus
> has already pulled into his tree ;-/ �Still not there in tree on hera,
> but that's not saying much. �Linus?

I fetched it, amended it, and merged the amended commit.

Linus