Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

DoS with unprivileged mounts

33 views
Skip to first unread message

Miklos Szeredi

unread,
Aug 14, 2013, 1:50:02 PM8/14/13
to
There's a simple and effective way to prevent unlink(2) and rename(2)
from operating on any file or directory by simply mounting something
on it. In any mount instance in any namespace.

Was this considered in the unprivileged mount design?

The solution is also theoretically simple: mounts in unpriv namespaces
are marked "volatile" and are dissolved on an unlink type operation.

Such volatile mounts would be useful in general too.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andy Lutomirski

unread,
Aug 14, 2013, 3:30:02 PM8/14/13
to
On 08/14/2013 10:42 AM, Miklos Szeredi wrote:
> There's a simple and effective way to prevent unlink(2) and rename(2)
> from operating on any file or directory by simply mounting something
> on it. In any mount instance in any namespace.
>
> Was this considered in the unprivileged mount design?
>
> The solution is also theoretically simple: mounts in unpriv namespaces
> are marked "volatile" and are dissolved on an unlink type operation.

I'd actually prefer the reverse: unprivileged mounts don't prevent
unlink and rename. If the dentry goes away, then the mount could still
exist, sans underlying file. (This is already supported on network
filesystems.)

--Andy

Eric W. Biederman

unread,
Aug 14, 2013, 3:40:02 PM8/14/13
to
Miklos Szeredi <mik...@szeredi.hu> writes:

> There's a simple and effective way to prevent unlink(2) and rename(2)
> from operating on any file or directory by simply mounting something
> on it. In any mount instance in any namespace.
>
> Was this considered in the unprivileged mount design?

The focus was on not fooling privileged applications and some of the
secondary effects that really should have been thought about in more
depth were like this one were overlooked.

> The solution is also theoretically simple: mounts in unpriv namespaces
> are marked "volatile" and are dissolved on an unlink type operation.
>
> Such volatile mounts would be useful in general too.

Agreed.

This is a problem that is a general pain with mount namespaces in
general.

I think the real technical hurdle is finding the mounts t in some random
mount namespace. Once we can do that relatively efficiently the rest
becomes simple.

Eric

Eric W. Biederman

unread,
Aug 14, 2013, 4:00:02 PM8/14/13
to
Andy Lutomirski <lu...@amacapital.net> writes:

> On 08/14/2013 10:42 AM, Miklos Szeredi wrote:
>> There's a simple and effective way to prevent unlink(2) and rename(2)
>> from operating on any file or directory by simply mounting something
>> on it. In any mount instance in any namespace.
>>
>> Was this considered in the unprivileged mount design?
>>
>> The solution is also theoretically simple: mounts in unpriv namespaces
>> are marked "volatile" and are dissolved on an unlink type operation.
>
> I'd actually prefer the reverse: unprivileged mounts don't prevent
> unlink and rename. If the dentry goes away, then the mount could still
> exist, sans underlying file. (This is already supported on network
> filesystems.)

Of course we do this in network filesystems by pretending the
rename/unlink did not actually happen. The vfs insists that it be lied
to instead of mirroring what actually happened.

Again all of this is a question about efficient data structures and not
really one of semantics. Can either semantic be implemented in such a
way that it does not slow down the vfs?

Eric

Andy Lutomirski

unread,
Aug 14, 2013, 4:30:02 PM8/14/13
to
On Wed, Aug 14, 2013 at 12:53 PM, Eric W. Biederman
<ebie...@xmission.com> wrote:
> Andy Lutomirski <lu...@amacapital.net> writes:
>
>> On 08/14/2013 10:42 AM, Miklos Szeredi wrote:
>>> There's a simple and effective way to prevent unlink(2) and rename(2)
>>> from operating on any file or directory by simply mounting something
>>> on it. In any mount instance in any namespace.
>>>
>>> Was this considered in the unprivileged mount design?
>>>
>>> The solution is also theoretically simple: mounts in unpriv namespaces
>>> are marked "volatile" and are dissolved on an unlink type operation.
>>
>> I'd actually prefer the reverse: unprivileged mounts don't prevent
>> unlink and rename. If the dentry goes away, then the mount could still
>> exist, sans underlying file. (This is already supported on network
>> filesystems.)
>
> Of course we do this in network filesystems by pretending the
> rename/unlink did not actually happen. The vfs insists that it be lied
> to instead of mirroring what actually happened.
>
> Again all of this is a question about efficient data structures and not
> really one of semantics. Can either semantic be implemented in such a
> way that it does not slow down the vfs?

Given that vfs_unlink has:

if (d_mountpoint(dentry))
error = -EBUSY;

I think it's just a matter of changing / deleting that code.

--Andy

Eric W. Biederman

unread,
Aug 14, 2013, 6:00:02 PM8/14/13
to
Deleting the code is completely unacceptable as it generates mounts that
can never be unmounted.

Changing this code is what we were discussing. My point is that an
efficient replacement is not immediately obvious, and a solution that
degrades the performance of the fast path of the vfs to make this case
work better is not likely to be acceptable.

Eric

Miklos Szeredi

unread,
Aug 15, 2013, 1:10:01 AM8/15/13
to
On Wed, Aug 14, 2013 at 9:32 PM, Eric W. Biederman
<ebie...@xmission.com> wrote:

>> The solution is also theoretically simple: mounts in unpriv namespaces
>> are marked "volatile" and are dissolved on an unlink type operation.
>>
>> Such volatile mounts would be useful in general too.
>
> Agreed.
>
> This is a problem that is a general pain with mount namespaces in
> general.
>
> I think the real technical hurdle is finding the mounts t in some random
> mount namespace. Once we can do that relatively efficiently the rest
> becomes simple.

We already have a "struct mountpoint" hashed on the dentry. Chaining
mounts on that mountpoint would be trivial. And we need a
MNT_VOLATILE flag and that's it. If we fear that traversing the list
of mounts on the dentry to check for non-volatile ones then we could
also add a separate volatile counter to struct mountpoint and a
matching flag to the dentry. But I don't think that's really
necessary.

Thanks,
Miklos

Eric W. Biederman

unread,
Aug 15, 2013, 2:50:01 AM8/15/13
to
Miklos Szeredi <mik...@szeredi.hu> writes:

> On Wed, Aug 14, 2013 at 9:32 PM, Eric W. Biederman
> <ebie...@xmission.com> wrote:
>
>>> The solution is also theoretically simple: mounts in unpriv namespaces
>>> are marked "volatile" and are dissolved on an unlink type operation.
>>>
>>> Such volatile mounts would be useful in general too.
>>
>> Agreed.
>>
>> This is a problem that is a general pain with mount namespaces in
>> general.
>>
>> I think the real technical hurdle is finding the mounts t in some random
>> mount namespace. Once we can do that relatively efficiently the rest
>> becomes simple.
>
> We already have a "struct mountpoint" hashed on the dentry. Chaining
> mounts on that mountpoint would be trivial. And we need a
> MNT_VOLATILE flag and that's it. If we fear that traversing the list
> of mounts on the dentry to check for non-volatile ones then we could
> also add a separate volatile counter to struct mountpoint and a
> matching flag to the dentry. But I don't think that's really
> necessary.

*Blink* I had overlooked "struct mountpoint". That indeed makes things
easier.

I agree we can chain "struct mount" on "struct mountpoint" and then we
would have an efficient implementation, that does not impact the vfs
fast path.

After that it becomes a question of permissions and semantics.

I am in the process of adopting the rule that something that is not
visible at the time we copy a set of mounts should not become visible
in the child mount namespace. Grr. This has been a busy month and
despite having been reviewed I haven't gotten around to pushing that
patch to linux-next.

But MNT_VOLATILE by definition can not reveal anything becasue the
underlying mount point is removed, so that all of that weirdness is in
propogating mounts between mount namespaces is not relevant here.

This is however the propogation of an unmount between mount namespaces.

In general we don't even need the MNT_VOLATILE flag we just need the
appropriate permission checks. However we do need something like
MNT_VOLATILE to prevent surprises. MNT_VOLATILE would be used to
prevent things like:

# mount --bind / /mnt
# rmdir /mnt/usr

I think the root user would be rather annoyed if that worked, so it does
appear we need something like MNT_VOLATILE.

Part of me does prefer the semantics Andy has suggested where instead of
unmounting things we have something like a skeleton of the mount tree
unioned with dcaches of the filesystems themselves. With "struct
mountpoint" we are amazing close to that already.

A mount skeleton would allow us to always remove and rename directories
and files without really caring, about what mounts were present.
Probably with just a quick lookup to see if we need to set
DCACHE_MOUNTED.

The big practical problem I can see with MNT_VOLATILE is mount points in
shared directories like /tmp but without the sticky set. At which point
it would be possible to delete another users mount points. Perhaps we
need restrictions on where a user can mount.

Are there any practical limitations we can add to mount that will
ensure MNT_VOLATILE won't impact other users? inode_capable looks like
a good canidate. Which effectively boils down to /proc/<pid>/uid_map.

I will have to see if that semantic is workable, but at first glance it
looks promising.

So in summary I am thinking.

- Restrict on which files/directories an unprivileged user can mount.
That immediately prevents most weirdness.

- Allow unlink and rmdir and rename to cause recursive lazy unmounts if
the caller has the appropriate permissions.

Perhaps the permissions needed include MNT_VOLATILE.

Any other suggestions?

Eric

Andy Lutomirski

unread,
Aug 15, 2013, 3:00:01 AM8/15/13
to
On Wed, Aug 14, 2013 at 11:45 PM, Eric W. Biederman
Two possible nasty cases:

1. mount whatever /tmp/foo/bar; rmdir /tmp/foo/bar; rmdir /tmp/foo

Presumably ls /tmp shouldn't show foo. Should cd /tmp/foo/bar work?
What about umount /tmp/foo/bar? What about cd /tmp/foo?

2. mount whatever /tmp/foo; rmdir /tmp/foo; mkdir /tmp/foo

Ugh.

--Andy

Eric W. Biederman

unread,
Aug 15, 2013, 4:00:04 AM8/15/13
to
You have to have two mount namespaces or at least two different paths to
to the same filesystem to make this work. rdir /tmp/foo/bar where
/tmp/foo/bar is a mountpoint in your mount namespace will not work
because you are trying to remove a root directory.

So the semantics I would expect to see if it was implementable is
/tmp/foo and /tmp/foo/bar would continue to exist on the paths where
/tmp/foo/bar was a mount point and would disappear as soon as it was
unmounted.

> 2. mount whatever /tmp/foo; rmdir /tmp/foo; mkdir /tmp/foo
>
> Ugh.

Likewise. I would expect to see the new /tmp/foo slide under the old
/tmp/foo mountpoint.

Essentially my expectation would be that the mount points would float
over the filesystems. Semantically I like it, and have played with the
idea before. Implementation wise shrug I didn't realize any of this was
close to being practically implementatable until today.

Eric

Miklos Szeredi

unread,
Aug 15, 2013, 5:30:02 AM8/15/13
to
On Thu, Aug 15, 2013 at 8:45 AM, Eric W. Biederman
<ebie...@xmission.com> wrote:
> Part of me does prefer the semantics Andy has suggested where instead of
> unmounting things we have something like a skeleton of the mount tree
> unioned with dcaches of the filesystems themselves. With "struct
> mountpoint" we are amazing close to that already.
>
> A mount skeleton would allow us to always remove and rename directories
> and files without really caring, about what mounts were present.
> Probably with just a quick lookup to see if we need to set
> DCACHE_MOUNTED.

Yes, we could have a separate dentry tree just for anchoring mounts
and we could make a union with the real dentry tree. But implementing
that in a low overhead manner is not trivial. Anchoring mounts on
real dentries is *rather* convenient. And yes, we only actually need
the skeleton dentries when the real ones disappear, but that doesn't
make the implementation any simpler.

> The big practical problem I can see with MNT_VOLATILE is mount points in
> shared directories like /tmp but without the sticky set. At which point
> it would be possible to delete another users mount points. Perhaps we
> need restrictions on where a user can mount.

I think if user X mounts something on /tmp/foo owned by user Y and
user Y removes /tmp/foo then that shouldn't really surprise user X. I
don't see this an issue.

Thanks,
Miklos

Rob Landley

unread,
Aug 15, 2013, 5:30:03 PM8/15/13
to
On 08/14/2013 12:42:19 PM, Miklos Szeredi wrote:
> There's a simple and effective way to prevent unlink(2) and rename(2)
> from operating on any file or directory by simply mounting something
> on it. In any mount instance in any namespace.
>
> Was this considered in the unprivileged mount design?
>
> The solution is also theoretically simple: mounts in unpriv namespaces
> are marked "volatile" and are dissolved on an unlink type operation.
>
> Such volatile mounts would be useful in general too.

Would that "anonymous inode" thing that wandered by recently help,
letting umount move the mount to one side so you could keep the mount
point as the root of your per-process hierarchy but not have it glued
to other people's namespaces?

Rob--

Eric W. Biederman

unread,
Oct 4, 2013, 6:50:01 PM10/4/13
to

Signed-off-by: Eric W. Biederman <ebied...@twitter.com>
---
fs/mount.h | 2 ++
fs/namespace.c | 5 +++++
2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 64a858143ff9..e4342b8dfab1 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -21,6 +21,7 @@ struct mnt_pcp {
struct mountpoint {
struct list_head m_hash;
struct dentry *m_dentry;
+ struct list_head m_list;
int m_count;
};

@@ -47,6 +48,7 @@ struct mount {
struct mount *mnt_master; /* slave is on master->mnt_slave_list */
struct mnt_namespace *mnt_ns; /* containing namespace */
struct mountpoint *mnt_mp; /* where is it mounted */
+ struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
#ifdef CONFIG_FSNOTIFY
struct hlist_head mnt_fsnotify_marks;
__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index da5c49483430..d092964fe7f9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -197,6 +197,7 @@ static struct mount *alloc_vfsmnt(const char *name)
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
INIT_LIST_HEAD(&mnt->mnt_slave);
+ INIT_LIST_HEAD(&mnt->mnt_mp_list);
#ifdef CONFIG_FSNOTIFY
INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
#endif
@@ -636,6 +637,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
mp->m_dentry = dentry;
mp->m_count = 1;
list_add(&mp->m_hash, chain);
+ INIT_LIST_HEAD(&mp->m_list);
return mp;
}

@@ -691,6 +693,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
list_del_init(&mnt->mnt_hash);
put_mountpoint(mnt->mnt_mp);
mnt->mnt_mp = NULL;
+ list_del_init(&mnt->mnt_mp_list);
}

/*
@@ -705,6 +708,7 @@ void mnt_set_mountpoint(struct mount *mnt,
child_mnt->mnt_mountpoint = dget(mp->m_dentry);
child_mnt->mnt_parent = mnt;
child_mnt->mnt_mp = mp;
+ list_add_tail(&child_mnt->mnt_mp_list, &mp->m_list);
}

/*
@@ -1193,6 +1197,7 @@ void umount_tree(struct mount *mnt, int propagate)
p->mnt_parent->mnt_ghosts++;
put_mountpoint(p->mnt_mp);
p->mnt_mp = NULL;
+ list_del_init(&mnt->mnt_mp_list);
}
change_mnt_propagation(p, MS_PRIVATE);
}
--
1.7.5.4

Eric W. Biederman

unread,
Oct 4, 2013, 6:50:02 PM10/4/13
to

With the introduction of mount namespaces and bind mounts it because
possible to access files and directories that in other locations in
were used as mount points. Especially with mount namespaces has
become very confusing why rm -rf somedir return -EBUSY because some
directory is mounted somewhere else. With the addition of user
namespaces allowing unprivileged mounts this condition has gone from
annoying to allowing a DOS attack on more privileged users.

The simplest approach appears to be to remove the -EBUSY message,
allow unlink and rename, and lazily unmount the mount point.

In most cases this is less surprising as this is an implementation
of the normal unix behavior of allowing unlinking of files.

The change implemented in this patch allows the following to succeed:

The vfs does not currently follow paths up to the final component for
the rename and unlink system calls making the boldest version of this
idea the simplest to implement. Which should it simple to spot problems
with this idea.

While different from our historical behavior this change does not look
like it will break anything, or introduce any security
vulnerabilities. In a quick survey of all of the common mount points
on linux systems I found mount points in directories owned and
modifiable by root, and fuse fuse mounts in directories owned by the
``mounter'' of the fuse filesystem. In both of these cases relying on
the permissions of the directory does not practically change the user
who is allowed to unmount the filesystem.

Attempting to anticipate cases I have not witnessed I observe that
every directory in a trusted path to a file must limit modification
such that no one else may modify that directory. For files trusted by
suid root executables root most own and be the only user capable of
modifying the directory and all parent directories for the files to be
safe. Therefore for mount points part of a trusted path only root
should be able to unlink any directory or file on that path. Which
means after this change for a secured path only root can unmount
directories.

For mount points part of a path we can not trust we should not care if
the just disappear, as that is just another kind of arbitrary
manipulation.

So I conclude that the existing conditions will ensure that the permissions
on directories will be sufficiently limited that the new unmount on unlink
behavior will not cause problems.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/afs/dir.c | 3 +-
fs/dcache.c | 80 ++++++++++++++++++++----------------------------
fs/fuse/dir.c | 3 +-
fs/gfs2/dentry.c | 4 +--
fs/namei.c | 31 ++++++------------
fs/nfs/dir.c | 5 +--
fs/sysfs/dir.c | 9 +-----
include/linux/dcache.h | 3 +-
8 files changed, 51 insertions(+), 87 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 646337dc5201..7fb69d45f1b9 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -686,8 +686,7 @@ not_found:

out_bad:
/* don't unhash if we have submounts */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_skip;
+ shrink_submounts_and_drop(dentry);

_debug("dropping dentry %s/%s",
parent->d_name.name, dentry->d_name.name);
diff --git a/fs/dcache.c b/fs/dcache.c
index 41000305d716..1e9bf96b0132 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1373,7 +1373,7 @@ int d_set_mounted(struct dentry *dentry)
int ret = -ENOENT;
write_seqlock(&rename_lock);
for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
- /* Need exclusion wrt. check_submounts_and_drop() */
+ /* Need exclusion wrt. shrink_submounts_and_drop() */
spin_lock(&p->d_lock);
if (unlikely(d_unhashed(p))) {
spin_unlock(&p->d_lock);
@@ -1478,70 +1478,56 @@ void shrink_dcache_parent(struct dentry *parent)
}
EXPORT_SYMBOL(shrink_dcache_parent);

-static enum d_walk_ret check_and_collect(void *_data, struct dentry *dentry)
+struct detach_data {
+ struct dentry *found;
+};
+static enum d_walk_ret do_detach_submounts(void *ptr, struct dentry *dentry)
{
- struct select_data *data = _data;
-
- if (d_mountpoint(dentry)) {
- data->found = -EBUSY;
- return D_WALK_QUIT;
- }
-
- return select_collect(_data, dentry);
-}
+ struct detach_data *data = ptr;

-static void check_and_drop(void *_data)
-{
- struct select_data *data = _data;
+ if (d_mountpoint(dentry))
+ data->found = dentry;

- if (d_mountpoint(data->start))
- data->found = -EBUSY;
- if (!data->found)
- __d_drop(data->start);
+ return data->found ? D_WALK_QUIT : D_WALK_CONTINUE;
}

/**
- * check_submounts_and_drop - prune dcache, check for submounts and drop
+ * detach_submounts - check for submounts and detach them.
*
- * All done as a single atomic operation relative to has_unlinked_ancestor().
- * Returns 0 if successfully unhashed @parent. If there were submounts then
- * return -EBUSY.
+ * @dentry: dentry to find mount points under.
*
- * @dentry: dentry to prune and drop
+ * If dentry or any of it's children is a mount point detach those mounts.
*/
-int check_submounts_and_drop(struct dentry *dentry)
+void detach_submounts(struct dentry *dentry)
{
- int ret = 0;
-
- /* Negative dentries can be dropped without further checks */
- if (!dentry->d_inode) {
- d_drop(dentry);
- goto out;
- }
-
+ struct detach_data data;
for (;;) {
- struct select_data data;
-
- INIT_LIST_HEAD(&data.dispose);
- data.start = dentry;
- data.found = 0;
+ data.found = NULL;
+ d_walk(dentry, &data, do_detach_submounts, NULL);

- d_walk(dentry, &data, check_and_collect, check_and_drop);
- ret = data.found;
-
- if (!list_empty(&data.dispose))
- shrink_dentry_list(&data.dispose);
-
- if (ret <= 0)
+ if (!data.found)
break;

+ detach_mounts(data.found);
cond_resched();
}
+ detach_mounts(dentry);
+}

-out:
- return ret;
+/**
+ * shrink_submounts_and_drop - detach submounts, prune dcache, and drop
+ *
+ * All done as a single atomic operation reletaive to d_set_mounted().
+ *
+ * @dentry: dentry to detach, prune and drop
+ */
+void shrink_submounts_and_drop(struct dentry *dentry)
+{
+ d_drop(dentry);
+ detach_submounts(dentry);
+ shrink_dcache_parent(dentry);
}
-EXPORT_SYMBOL(check_submounts_and_drop);
+EXPORT_SYMBOL(shrink_submounts_and_drop);

/**
* __d_alloc - allocate a dcache entry
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 62b43b577bfc..b1cd7b79a325 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -259,8 +259,7 @@ out:

invalid:
ret = 0;
- if (check_submounts_and_drop(entry) != 0)
- ret = 1;
+ shrink_submounts_and_drop(entry);
goto out;
}

diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
index d3a5d4e29ba5..2ecc2b873829 100644
--- a/fs/gfs2/dentry.c
+++ b/fs/gfs2/dentry.c
@@ -93,9 +93,7 @@ invalid_gunlock:
if (!had_lock)
gfs2_glock_dq_uninit(&d_gh);
invalid:
- if (check_submounts_and_drop(dentry) != 0)
- goto valid;
-
+ shrink_submounts_and_drop(dentry);
dput(parent);
return 0;

diff --git a/fs/namei.c b/fs/namei.c
index 645268f23eb6..b18b017c946b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3560,10 +3560,6 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
dget(dentry);
mutex_lock(&dentry->d_inode->i_mutex);

- error = -EBUSY;
- if (d_mountpoint(dentry))
- goto out;
-
error = security_inode_rmdir(dir, dentry);
if (error)
goto out;
@@ -3575,6 +3571,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)

dentry->d_inode->i_flags |= S_DEAD;
dont_mount(dentry);
+ detach_mounts(dentry);

out:
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -3657,14 +3654,12 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
return -EPERM;

mutex_lock(&dentry->d_inode->i_mutex);
- if (d_mountpoint(dentry))
- error = -EBUSY;
- else {
- error = security_inode_unlink(dir, dentry);
+ error = security_inode_unlink(dir, dentry);
+ if (!error) {
+ error = dir->i_op->unlink(dir, dentry);
if (!error) {
- error = dir->i_op->unlink(dir, dentry);
- if (!error)
- dont_mount(dentry);
+ dont_mount(dentry);
+ detach_mounts(dentry);
}
}
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -3988,10 +3983,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
if (target)
mutex_lock(&target->i_mutex);

- error = -EBUSY;
- if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
- goto out;
-
error = -EMLINK;
if (max_links && !target && new_dir != old_dir &&
new_dir->i_nlink >= max_links)
@@ -4006,6 +3997,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
if (target) {
target->i_flags |= S_DEAD;
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
}
out:
if (target)
@@ -4031,16 +4023,15 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
if (target)
mutex_lock(&target->i_mutex);

- error = -EBUSY;
- if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
- goto out;
-
error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
if (error)
goto out;

- if (target)
+ if (target) {
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
+ }
+ detach_mounts(old_dentry);
if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
d_move(old_dentry, new_dentry);
out:
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 854a8f05a610..e8e35acd8850 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1142,10 +1142,7 @@ out_zap_parent:
if (dentry->d_flags & DCACHE_DISCONNECTED)
goto out_valid;
}
- /* If we have submounts, don't unhash ! */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
-
+ shrink_submounts_and_drop(dentry);
dput(parent);
dfprintk(LOOKUPCACHE, "NFS: %s(%s/%s) is invalid\n",
__func__, dentry->d_parent->d_name.name,
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 4d83cedb9fcb..477c66d4e2a8 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -327,7 +327,6 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, unsigned int flags)
}

mutex_unlock(&sysfs_mutex);
-out_valid:
return 1;
out_bad:
/* Remove the dentry from the dcache hashes.
@@ -341,13 +340,7 @@ out_bad:
* to the dcache hashes.
*/
mutex_unlock(&sysfs_mutex);
-
- /* If we have submounts we must allow the vfs caches
- * to lie about the state of the filesystem to prevent
- * leaks and other nasty things.
- */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
+ shrink_submounts_and_drop(dentry);

return 0;
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 59066e0b4ff1..17948b49f3d5 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -254,7 +254,8 @@ extern void d_prune_aliases(struct inode *);

/* test whether we have any submounts in a subdir tree */
extern int have_submounts(struct dentry *);
-extern int check_submounts_and_drop(struct dentry *);
+extern void detach_submounts(struct dentry *dentry);
+extern void shrink_submounts_and_drop(struct dentry *);

/*
* This adds the entry to the hash queues.
--
1.7.5.4

Eric W. Biederman

unread,
Oct 4, 2013, 6:50:02 PM10/4/13
to

This patchset is an attempt to address two problems:
1) Not all modifications to the filesystems happen through the vfs and
since the vfs can not cope with a mount point being unlinked or
renamed filesystems whose modifications that do not come through the
vfs are required to lie.

2) Through an oversight it is now possible for one unprivileged user to
mount something on another unprivileged users dentry and make it
impossible for the other user to unlink or rename that dentry.

It is now technically possible to easily lift the restriction on
unlinking and renaming files with mount points on them, with a
corresponding reduction in complexity of the vfs semantics and a small
code side reduction.

After thinking about it removing the restrictions on mount points
appears safe, because it is just plain dumb to have a mount point
in a directory that is not restricted to root only modifications.

This is a change in user visible semantics, so I want to be very careful
about this. Are there any reasons to not make this change?

All of this happens under a full pile of vfs locks so this shouldn't
affect the vfs scalabilitiy work that is on-going.

Eric W. Biederman (3):
vfs: Keep a list of mounts on a mount point
vfs: Add a function to lazily unmount all mounts from any dentry.
vfs: Lazily remove mounts on unlinked files and directories.

fs/afs/dir.c | 3 +-
fs/dcache.c | 80 ++++++++++++++++++++----------------------------
fs/fuse/dir.c | 3 +-
fs/gfs2/dentry.c | 4 +--
fs/mount.h | 3 ++
fs/namei.c | 31 ++++++------------
fs/namespace.c | 29 +++++++++++++++++
fs/nfs/dir.c | 5 +--
fs/sysfs/dir.c | 9 +-----
include/linux/dcache.h | 3 +-
10 files changed, 83 insertions(+), 87 deletions(-)

Eric W. Biederman

unread,
Oct 4, 2013, 6:50:02 PM10/4/13
to

Signed-off-by: Eric W. Biederman <ebied...@twitter.com>
---
fs/mount.h | 1 +
fs/namespace.c | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index e4342b8dfab1..7a6a2bb3f290 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -79,6 +79,7 @@ static inline int is_mounted(struct vfsmount *mnt)
}

extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
+extern void detach_mounts(struct dentry *dentry);

static inline void get_mnt_ns(struct mnt_namespace *ns)
{
diff --git a/fs/namespace.c b/fs/namespace.c
index d092964fe7f9..8eaee0c14fdb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1294,6 +1294,30 @@ static int do_umount(struct mount *mnt, int flags)
return retval;
}

+void detach_mounts(struct dentry *dentry)
+{
+ struct mount *mnt, *next;
+ struct mountpoint *mp;
+
+ namespace_lock();
+ if (!d_mountpoint(dentry)) {
+ namespace_unlock();
+ return;
+ }
+ mp = new_mountpoint(dentry);
+ if (IS_ERR(mp)) {
+ return;
+ }
+ br_write_lock(&vfsmount_lock);
+ list_for_each_entry_safe(mnt, next, &mp->m_list, mnt_mp_list) {
+ if (!list_empty(&mnt->mnt_list))
+ umount_tree(mnt, 1);
+ }
+ br_write_unlock(&vfsmount_lock);
+ put_mountpoint(mp);
+ namespace_unlock();
+}
+
/*
* Is the caller allowed to modify his namespace?
*/
--
1.7.5.4

Linus Torvalds

unread,
Oct 4, 2013, 7:30:02 PM10/4/13
to
On Fri, Oct 4, 2013 at 3:41 PM, Eric W. Biederman <ebie...@xmission.com> wrote:
>
> After thinking about it removing the restrictions on mount points
> appears safe, because it is just plain dumb to have a mount point
> in a directory that is not restricted to root only modifications.
>
> This is a change in user visible semantics, so I want to be very careful
> about this. Are there any reasons to not make this change?

At least one worry: people are very used to 'rmdir()' not removing
empty directories, and I've written code myself that just does an
'rmdir()' without worrying about it. I think git has code like "remove
file, then try to remove directory file is in, and recurse until it
fails or you hit the top of tree". And it all depends on knowing that
rmdir() is harmless, and returns the appropriate error when the
directory isn't empty.

And you're now changing that. If it's a mount-point, the rmdir just
succeeds, afaik.

Does anybody care? I dunno. But it looks like a _big_ semantic change.

That said, I like the _concept_ of being able to remove a mount-point
and the mount just goes away. But I do think that for sanity sake, it
should have something like "if one of the mounts is in the current
namespace, return -EBUSY". IOW, the patch-series would make the VFS
layer _able_ to remove mount-points, but a normal rmdir() when
something is mounted in that namespace would fail in order to give
legacy behavior.

Hmm?

Linus

Eric W. Biederman

unread,
Oct 4, 2013, 8:10:01 PM10/4/13
to
Linus Torvalds <torv...@linux-foundation.org> writes:

> On Fri, Oct 4, 2013 at 3:41 PM, Eric W. Biederman <ebie...@xmission.com> wrote:
>>
>> After thinking about it removing the restrictions on mount points
>> appears safe, because it is just plain dumb to have a mount point
>> in a directory that is not restricted to root only modifications.
>>
>> This is a change in user visible semantics, so I want to be very careful
>> about this. Are there any reasons to not make this change?
>
> At least one worry: people are very used to 'rmdir()' not removing
> empty directories, and I've written code myself that just does an
> 'rmdir()' without worrying about it. I think git has code like "remove
> file, then try to remove directory file is in, and recurse until it
> fails or you hit the top of tree". And it all depends on knowing that
> rmdir() is harmless, and returns the appropriate error when the
> directory isn't empty.
>
> And you're now changing that. If it's a mount-point, the rmdir just
> succeeds, afaik.
>
> Does anybody care? I dunno. But it looks like a _big_ semantic change.

Which is definitley why I am asking and being careful.

> That said, I like the _concept_ of being able to remove a mount-point
> and the mount just goes away. But I do think that for sanity sake, it
> should have something like "if one of the mounts is in the current
> namespace, return -EBUSY". IOW, the patch-series would make the VFS
> layer _able_ to remove mount-points, but a normal rmdir() when
> something is mounted in that namespace would fail in order to give
> legacy behavior.
>
> Hmm?

In principle I have no problems tweaking rmdir to check for that case.

At the same time the real reason that this is safe is that mount points
are almost always part of trusted paths to important files and you just
don't mess with those paths.

So tweaking rmdir to fail would be more about making stupid mistakes
like running "rm -rf /" fail than it would be about security or
correctness.

My intuition agrees with yours that it is less surprising if rmdir was
forbidden in mount namespaces where it has a mount. Root would still be
able to do crazy things like:

# unshare --mount /bin/bash
# umount -l /sys
# mv /proc /sys
# exit
#
# ls /proc
ls: cannot access /proc No such file or directory
#
# ls /sys
1 2 3 4 acpi/ asound/ buddyinfo bus/ cmdline config.gz consoles cpuinfo
crypto devices diskstats dma dir/ driver/ execdomains fb filesystems
fs/ interrupts iomem ioports irq/ kallsyms kcore key-users kmsg
kpagecount kpageflags loadavg locks mdstat meminfo misc modules mounts/
mtrr pagetypeinfo partitions schedstat scsi/ self/ slabinfo softirqs
stat swaps sys/ sysrq-trigger sysvipc/ timer_list tty/ uptime version
vmallocinfo vmstat zoneinfo
#


I just noticed that Al's latest vfs changes posted yesterday mean I need
to rebase and possibly respin these patches, as all of the locking and
interesting bits of the dcache have changed. I don't think the
conflicts would be fun to resolve during the merge window.

Eric

Eric W. Biederman

unread,
Oct 4, 2013, 10:00:02 PM10/4/13
to
ebie...@xmission.com (Eric W. Biederman) writes:

> I just noticed that Al's latest vfs changes posted yesterday mean I need
> to rebase and possibly respin these patches, as all of the locking and
> interesting bits of the dcache have changed. I don't think the
> conflicts would be fun to resolve during the merge window.

On second look the only conflict is the rename of the vfsmount_lock to
lock_mount_hash, and a few bits of context. So nothing interesting.

Eric W. Biederman

unread,
Oct 4, 2013, 10:40:01 PM10/4/13
to

Programs have been known to test for empty directories by attempting
to remove them. To keep from violating the principle of least
surprise don't let directories the caller can see with someting
mounted on them be deleted.

With a little luck this may prevent commands stupid commands
like rm -rf from eating your system.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/namei.c | 21 +++++++++++++++++++++
1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index b18b017c946b..b9cae480ac27 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3547,6 +3547,20 @@ void dentry_unhash(struct dentry *dentry)
spin_unlock(&dentry->d_lock);
}

+static bool covered(struct vfsmount *mnt, struct dentry *dentry)
+{
+ /* test to see if a dentry is covered with a mount in
+ * the current mount namespace.
+ */
+ bool is_covered;
+
+ rcu_read_lock();
+ is_covered = d_mountpoint(dentry) && __lookup_mnt(mnt, dentry, 1);
+ rcu_read_unlock();
+
+ return is_covered;
+}
+
int vfs_rmdir(struct inode *dir, struct dentry *dentry)
{
int error = may_delete(dir, dentry, 1);
@@ -3619,6 +3633,9 @@ retry:
error = -ENOENT;
goto exit3;
}
+ error = -EBUSY;
+ if (covered(nd.path.mnt, dentry))
+ goto exit3;
error = security_path_rmdir(&nd.path, dentry);
if (error)
goto exit3;
@@ -4155,6 +4172,10 @@ retry:
error = -ENOTEMPTY;
if (new_dentry == trap)
goto exit5;
+ error = -EBUSY;
+ if (new_dentry->d_inode && S_ISDIR(new_dentry->d_inode->i_mode) &&
+ covered(newnd.path.mnt, new_dentry))
+ goto exit5;

error = security_path_rename(&oldnd.path, old_dentry,
&newnd.path, new_dentry);
--
1.7.5.4

Serge E. Hallyn

unread,
Oct 5, 2013, 12:00:03 PM10/5/13
to
Quoting Eric W. Biederman (ebie...@xmission.com):
>
> Programs have been known to test for empty directories by attempting
> to remove them. To keep from violating the principle of least
> surprise don't let directories the caller can see with someting
> mounted on them be deleted.
>
> With a little luck this may prevent commands stupid commands
> like rm -rf from eating your system.
>
> Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>

Thanks. I've not had a chance to look at the implementation details
(hope to do so this weekend) but ack to the change itself being
needed. AFAIK this was the last uncomfortable piece about unprivileged
CLONE_NEWUSER.

Rob Landley

unread,
Oct 5, 2013, 7:10:02 PM10/5/13
to
On 10/04/2013 05:41:25 PM, Eric W. Biederman wrote:
>
> This patchset is an attempt to address two problems:
> 1) Not all modifications to the filesystems happen through the vfs and
> since the vfs can not cope with a mount point being unlinked or
> renamed filesystems whose modifications that do not come through
> the
> vfs are required to lie.
>
> 2) Through an oversight it is now possible for one unprivileged user
> to
> mount something on another unprivileged users dentry and make it
> impossible for the other user to unlink or rename that dentry.
>
> It is now technically possible to easily lift the restriction on
> unlinking and renaming files with mount points on them, with a
> corresponding reduction in complexity of the vfs semantics and a small
> code side reduction.

A todo item I've had _forever_ is fixing chroot() to not be broken so
that you can trivially break out of a chroot via:

chdir("/");
mkdir("sub");
chroot("sub");
chdir("./../../../../../../../..");

(Because chroot() affects where "/" points but NOT where "." points to,
and chdir does an == check with the dentry "/" points at to know when
to stop, so if you move "/" under "." you can back up to the actual
root of the tree.)

The above is why lxc uses pivot_root() instead of chroot().

These days, we have multiple mount trees so there's no reason chroot()
can't trim the process local mount tree (creating a new bind mount if
necessary). Except my todo list runneth over and I haven't had a chance
to dig in and see what would be involved. (Last time I brought this up
people were wondering why chroot() didn't just move "." to the new "/"
if it wasn't under it. I had no idea, still don't.)

Rob--

Linus Torvalds

unread,
Oct 5, 2013, 7:20:02 PM10/5/13
to
On Sat, Oct 5, 2013 at 4:07 PM, Rob Landley <r...@landley.net> wrote:
>
> A todo item I've had _forever_ is fixing chroot() to not be broken so that
> you can trivially break out of a chroot via:

What drugs are you on?

Your example is moronic, and against all _documented_ uses of chroot.

> chdir("/");
> mkdir("sub");
> chroot("sub");
> chdir("./../../../../../../../..");

After you do a chroot(), you need to chdir *into* the root. The reason
chroot() itself doesn't do that is simple: you may still be doing
various setup stuff.

But your example is just stupid. Yes, chroot'ed environments can
generally be escaped, but your example escape is simply because you
didn't use chroot() correctly.

So learn this pattern: every time you use chroot, add a simple

chdir("/");

immediately after the chroot call.

Then, if you decide that you want to do some setup in between the two
(like the interface allows), that's fine, but always start off with
that "chroot+chdir" pattern.

(Similarly, if it turns out that you want to chdir somewhere else,
like "/home/user" after the chroot, then you can obviously remove the
now superfluous chdir("/"), but you always conceptually start off with
that chroot/chdir pair)

Linus

Al Viro

unread,
Oct 5, 2013, 7:20:02 PM10/5/13
to
On Sat, Oct 05, 2013 at 06:07:42PM -0500, Rob Landley wrote:
> A todo item I've had _forever_ is fixing chroot() to not be broken
> so that you can trivially break out of a chroot via:
>
> chdir("/");
> mkdir("sub");
> chroot("sub");
> chdir("./../../../../../../../..");
>
> (Because chroot() affects where "/" points but NOT where "." points
> to, and chdir does an == check with the dentry "/" points at to know
> when to stop, so if you move "/" under "." you can back up to the
> actual root of the tree.)
>
> The above is why lxc uses pivot_root() instead of chroot().
>
> These days, we have multiple mount trees so there's no reason
> chroot() can't trim the process local mount tree (creating a new
> bind mount if necessary). Except my todo list runneth over and I
> haven't had a chance to dig in and see what would be involved. (Last
> time I brought this up people were wondering why chroot() didn't
> just move "." to the new "/" if it wasn't under it. I had no idea,
> still don't.)

1) RTFUNIXFAQ. chroot() never has been root-proof.

2) your "fix" isn't - it will lead to mounts done by chrooted process
not affecting other processes in the same namespace.

Al Viro

unread,
Oct 5, 2013, 7:30:01 PM10/5/13
to
On Sat, Oct 05, 2013 at 04:17:55PM -0700, Linus Torvalds wrote:
> On Sat, Oct 5, 2013 at 4:07 PM, Rob Landley <r...@landley.net> wrote:
> >
> > A todo item I've had _forever_ is fixing chroot() to not be broken so that
> > you can trivially break out of a chroot via:
>
> What drugs are you on?
>
> Your example is moronic, and against all _documented_ uses of chroot.

... and even fixing that example (chroot *can* be escaped, if you can
call chroot(2) - mkdir /foo, chdir /, chroot /foo, chroot .. and you are out)
the whole thing is idiocy - chroot() is not and has never been root-proof
and anybody expecting it to be has failed to read any number of FAQs out
there.

Linus Torvalds

unread,
Oct 5, 2013, 7:30:01 PM10/5/13
to
On Sat, Oct 5, 2013 at 4:17 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> So learn this pattern: every time you use chroot, add a simple
>
> chdir("/");
>
> immediately after the chroot call.

.. btw, also make sure that you close all non-essential file
descriptors. Having any open directory file descriptors pointing to
outside the chroot is also a classic escape.

Even then, escaping chroot is usually fairly easy. Making a
escape-proof chroot is really quite hard. Basically impossible if you
allow root.

Rob Landley

unread,
Oct 5, 2013, 8:20:02 PM10/5/13
to
On 10/05/2013 06:17:55 PM, Linus Torvalds wrote:
> On Sat, Oct 5, 2013 at 4:07 PM, Rob Landley <r...@landley.net> wrote:
> >
> > A todo item I've had _forever_ is fixing chroot() to not be broken
> so that
> > you can trivially break out of a chroot via:
>
> What drugs are you on?

Enough caffeine to count as a plural.

> Your example is moronic, and against all _documented_ uses of chroot.

Yes. That's why it's an exploit.

My example allows someone who can chroot again to break out of a
previous chroot, which is the reason that lxc doesn't use chroot.

> > chdir("/");
> > mkdir("sub");
> > chroot("sub");
> > chdir("./../../../../../../../..");
>
> After you do a chroot(), you need to chdir *into* the root. The reason
> chroot() itself doesn't do that is simple: you may still be doing
> various setup stuff.

Yes, I asked you about this years ago before I implemented chroot and
switch_root for busybox and toybox, I know how to use it _right_. What
I'm saying is it doesn't prevent something running in the chroot from
doing it _wrong_, which is why lxc can't use chroot to implement
containers and instead uses pivot_root, which is a HORRIBLE system call
I'd like to move away from.

I want to implement a lightweight container command for toybox, but
when I dug into lxc and tried to swap their pivot_root for a simple
chroot, I got corrected.

> But your example is just stupid. Yes, chroot'ed environments can
> generally be escaped, but your example escape is simply because you
> didn't use chroot() correctly.

The above is code you run _in_ the chroot. After the first user used
chroot incorrectly.

The problem is that there's only one "/" symlink. If you move it down
under your current "." then nothing stops you from following .. all the
way up the tree. The fact you can do this after the fact, from within
an existing chroot, is the problem.

> So learn this pattern: every time you use chroot, add a simple
>
> chdir("/");
>
> immediately after the chroot call.

I do. But I can't stop other people from NOT doing that afterwards.

Rob--

Rob Landley

unread,
Oct 5, 2013, 8:20:02 PM10/5/13
to
On 10/05/2013 06:22:15 PM, Linus Torvalds wrote:
> On Sat, Oct 5, 2013 at 4:17 PM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
> >
> > So learn this pattern: every time you use chroot, add a simple
> >
> > chdir("/");
> >
> > immediately after the chroot call.
>
> .. btw, also make sure that you close all non-essential file
> descriptors. Having any open directory file descriptors pointing to
> outside the chroot is also a classic escape.
>
> Even then, escaping chroot is usually fairly easy. Making a
> escape-proof chroot is really quite hard. Basically impossible if you
> allow root.

Which is why containers have all sorts of extra plumbing. But that
extra plumbing is currently built on pivot_root(), not on chroot(). And
I'd dismissed pivot_root() as residue from initramfs with all that
kernel thread reparenting, so it seemed like the wrong tool for the
job, but obviously I'll take your and Al's word it's not...

Rob--

Rob Landley

unread,
Oct 5, 2013, 8:20:02 PM10/5/13
to
On 10/05/2013 06:19:15 PM, Al Viro wrote:
> On Sat, Oct 05, 2013 at 06:07:42PM -0500, Rob Landley wrote:
> > A todo item I've had _forever_ is fixing chroot() to not be broken
> > so that you can trivially break out of a chroot via:
> >
> > chdir("/");
> > mkdir("sub");
> > chroot("sub");
> > chdir("./../../../../../../../..");
> >
> > (Because chroot() affects where "/" points but NOT where "." points
> > to, and chdir does an == check with the dentry "/" points at to know
> > when to stop, so if you move "/" under "." you can back up to the
> > actual root of the tree.)
> >
> > The above is why lxc uses pivot_root() instead of chroot().
> >
> > These days, we have multiple mount trees so there's no reason
> > chroot() can't trim the process local mount tree (creating a new
> > bind mount if necessary). Except my todo list runneth over and I
> > haven't had a chance to dig in and see what would be involved. (Last
> > time I brought this up people were wondering why chroot() didn't
> > just move "." to the new "/" if it wasn't under it. I had no idea,
> > still don't.)
>
> 1) RTFUNIXFAQ. chroot() never has been root-proof.
>
> 2) your "fix" isn't - it will lead to mounts done by chrooted process
> not affecting other processes in the same namespace.

So if I write a lightweight container setup command, I need to use
pivot_root just like lxc does?

Rob--

Rob Landley

unread,
Oct 5, 2013, 8:30:02 PM10/5/13
to
On 10/05/2013 06:24:51 PM, Al Viro wrote:
> On Sat, Oct 05, 2013 at 04:17:55PM -0700, Linus Torvalds wrote:
> > On Sat, Oct 5, 2013 at 4:07 PM, Rob Landley <r...@landley.net> wrote:
> > >
> > > A todo item I've had _forever_ is fixing chroot() to not be
> broken so that
> > > you can trivially break out of a chroot via:
> >
> > What drugs are you on?
> >
> > Your example is moronic, and against all _documented_ uses of
> chroot.
>
> ... and even fixing that example (chroot *can* be escaped, if you can
> call chroot(2) - mkdir /foo, chdir /, chroot /foo, chroot .. and you
> are out)

Um, wasn't that my example?

> the whole thing is idiocy - chroot() is not and has never been
> root-proof
> and anybody expecting it to be has failed to read any number of FAQs
> out
> there.

Which is why I was proposing modifying it so lxc and containers didn't
have to use pivot_root() instead, with all the logic to rehome existing
processes and the weird "this path is relative the old root, this path
is relative to the new root" syntax from initial ramdisks.

Rob--

Linus Torvalds

unread,
Oct 5, 2013, 8:40:02 PM10/5/13
to
On Sat, Oct 5, 2013 at 5:18 PM, Rob Landley <r...@landley.net> wrote:
>
> Which is why containers have all sorts of extra plumbing. But that extra
> plumbing is currently built on pivot_root(), not on chroot(). And I'd
> dismissed pivot_root() as residue from initramfs with all that kernel thread
> reparenting, so it seemed like the wrong tool for the job, but obviously
> I'll take your and Al's word it's not...

Yeah, chroot() really doesn't cut it if you allow root access - and
thus internal chroot() calls - as you noticed (I didn't realize that
your example was meant to be run _inside_ a chroot).

There are generally other ways to escape chroot too if you're root.
It's just too hard to plug. That doesn't make chroot() useless - it
just means that the uses are elsewhere (it's useful for various
non-security issues like development environments, but it can also be
useful as one small _part_ of some bigger model, like a VM etc).

pivot_root() does end up being a "better chroot than chroot" if you're
looking for containment. It may not be a pretty system call, but it
does avoid at least the most obvious gotchas with chroot()..

Linus

Eric W. Biederman

unread,
Oct 5, 2013, 8:50:01 PM10/5/13
to
Rob Landley <r...@landley.net> writes:

> On 10/04/2013 07:03:23 PM, Eric W. Biederman wrote:
>>
>> In principle I have no problems tweaking rmdir to check for that case.
>>
>> At the same time the real reason that this is safe is that mount
>> points
>> are almost always part of trusted paths to important files and you
>> just
>> don't mess with those paths.
>>
>> So tweaking rmdir to fail would be more about making stupid mistakes
>> like running "rm -rf /" fail than it would be about security or
>> correctness.
>
> If you do an rm -rf descending through a mount point, it's going to
> delete the contents of the mount point before _trying_ to unlink the
> mount point, which may be bad for the thing you mounted. Then there's
> the fun corner case of "the directory wasn't empty before it was
> mounted on, and umounting revealed the overmounted files, so the unlink
> fails for that reason even after magic umount".

Yes. Adding the restrictions I added in 4/3 are really just about
preserving peoples experience that rmdir on a directory ls shows files
in fails. Just to be a little less surprising.

> Doing rmdir on a non-empty directory won't delete it if it _isn't_ a
> mount point, and presumably we require write access to directories to
> mount on them, so in what ways is this different than another user
> mucking about with my files asynchronously?

We don't require write access to files or directories before we mount on
them.

The permissions we require to mount is that the directory or file is
visible aka execute permissions all of the way to /, and that the
mounter has CAP_SYS_ADMIN in the user namespace that created the mount
namespace. Basically that the mounter is locally root.

In practice this means that if someone mounts something on your
directory it happens in another mount namespace and you don't care.
You don't even know about it unless you call rmdir and today that
returns -EBUSY.

The permission checks could be extended to ensure that the uid of the
file or directory is mapped in to the mounter's user namespace. However
that would arguably require extending the vfs -EBUSY handling to chown
or chmod. And even without that we are still left with the mess that
is the current magic insertion of -EBUSY because we have to deal with
mount point somehow.

So what I am proposing is removing the -EBUSY handling making it up to
whoever creates a mount to put the mount on path they trust if they
don't want the path to go away.

So for the owner of the directory who doesn't see the mount it is
nothing like a user scribbling into the directory.

With these patches it becomes almost exactly like scribling in someone
elses directory for the creator of the mount.


Further even without user namespaces and with root creating all of the
mounts we remove a current challenge in the VFS that requires
filesystems to lie about their directory structure.


Unless there is someone crazy enough to put mount points in directories
that are world writable I think my current patchset is a very nice
cleanup.

I have meandered all over the field. Did I answer your question?
I wasn't exactly certain what you were asking.

FYI: As for chroot you can not create a user namespace if you are
chrooted, so I suggest you file chroot under that old thing that was
an early poor approximation of mount namespaces that we don't need
anymore.

Eric

Serge E. Hallyn

unread,
Oct 7, 2013, 12:30:02 AM10/7/13
to
Quoting Eric W. Biederman (ebie...@xmission.com):
>

Serge E. Hallyn

unread,
Oct 7, 2013, 12:40:01 AM10/7/13
to
Quoting Eric W. Biederman (ebie...@xmission.com):
>
> Programs have been known to test for empty directories by attempting
> to remove them. To keep from violating the principle of least
> surprise don't let directories the caller can see with someting
> mounted on them be deleted.

Do you think we should do the same thing for over-mounted file at
vfs_unlink()?

Eric W. Biederman

unread,
Oct 7, 2013, 2:20:01 AM10/7/13
to
Good catch. Thank you.

Eric

Eric W. Biederman

unread,
Oct 7, 2013, 3:00:04 AM10/7/13
to
"Serge E. Hallyn" <se...@hallyn.com> writes:

> Quoting Eric W. Biederman (ebie...@xmission.com):
>>
>> Programs have been known to test for empty directories by attempting
>> to remove them. To keep from violating the principle of least
>> surprise don't let directories the caller can see with someting
>> mounted on them be deleted.
>
> Do you think we should do the same thing for over-mounted file at
> vfs_unlink()?

We easily could.

The point of the patch is to just preserve the directory is empty don't
allow rmdir to succeed semantics, and as typically we can see something
in the directory because of the mount it doesn't make sense for rmdir to
succeed.

unlink doesn't have any occassions when the permissions are sufficient
to remove a directory where it will fail. So I don't see the point of
doing this for anything except directories.

Except for possibly the oddball rmdir semantics mentioned I don't think
this patch should be part of anyone's correctness analysis.



It is easiest to see that this series of changes is semantically safe if
we are safe to run unprivileged code in a mount namespace where root has
locally unmounted every mount point.

We do have the restriction that in a user namespace we can't unmount
anything root was mounted outside the user namespace. Which combined
with the above patch would be roughly equivalent to todays mount
restrictions for the common case. Unfortunately being only roughly
equivalent the analysis gets very complicated, and complicated reasoning
usually means invalid reasoning.


So if we can feel safe just depending on the parent directory
permissions (which are not hidden by a mount) protecting our mount
points, I feel much better about this patchset.



But if you can articulate some reasons why it would be better and less
surprising for unlink to fail I am willing to listen.

Andy Lutomirski

unread,
Oct 7, 2013, 1:00:03 PM10/7/13
to
I feel safe about this (in the safe-against-attack sense, not
necessary the safe-against-confused-users sense) because the whole
point of preventing unmount of things that were mounted by root is to
prevent untrusted users from seeing what's under the mount. But rmdir
and unlink don't let you see what was under the mount because they
remove it.

(Hmm. I wonder if rename on a mountpoint could work. I've wanted
this on occasion.)

--Andy

Eric W. Biederman

unread,
Oct 7, 2013, 6:30:01 PM10/7/13
to
My premise is that the directory permissions of the directory that
contains the mountpoint are already restrictive enough that it safe to
rely on them for permission to unlink and thus unmount.

I believe my premise is correct however I am human and make mistakes.

As far as I can tell a world writable / or /dev has a lot of other potential
issues and similarly with other directories where we put mount points,
so people just won't create those directories with bad permissions.

I am reitterating just so that what I depend upon is clear and if anyone
knows a counter example they can tell me.

I think by preventing rmdir (when the mount is in the same mount
namespace) and allowing rename and unlink (that do not unlink a
directory) we present a minimal surprise to users.

> (Hmm. I wonder if rename on a mountpoint could work. I've wanted
> this on occasion.)

With this set of patches rename on a mountpoint is supported. That will
break /etc/mtab but in the cases where it really matters renaming a
mount point will break other things. Which makes in most cases renaming
a mount point a case of don't do that then, while still being well
defined for the cases where it is not a problem.

Eric

Karel Zak

unread,
Oct 8, 2013, 4:10:02 AM10/8/13
to
On Sat, Oct 05, 2013 at 06:42:44PM -0500, Rob Landley wrote:
> Oh, attached is a dumb "zapchroot" script I've been using for years to
> unlink all mount points under a given directory, taking advantage of the
> fact that mount points are appended to the end of the list so if you unlink
> from the end to the front you should get the sub-mounts before the parent
> mounts (modulo mount --move not reordering the list, but that's uncommon).

util-linux umount supports --recursive, it uses /proc/self/mountinfo
to compose the hierarchy. The important is that the mountinfo file
contains Id and Parent_Id relations, so you don't rely on the order
only.

> Recently I noticed some kernels where chroot does _not_ trim the paths so
> that the paths you see in /proc/mounts are relevant to the current chroot
> but instead have all sorts of crap you can't access with no way to know what
> it's talking about. That was sad, I need to go figure out if that was distro
> breakage or vanilla breakage...

hmm..

Karel


--
Karel Zak <kz...@redhat.com>
http://karelzak.blogspot.com

Matthias Schniedermeyer

unread,
Oct 8, 2013, 6:50:02 AM10/8/13
to
On 06.10.2013 23:55, Eric W. Biederman wrote:
> "Serge E. Hallyn" <se...@hallyn.com> writes:
>
> So if we can feel safe just depending on the parent directory
> permissions (which are not hidden by a mount) protecting our mount
> points, I feel much better about this patchset.

As far as i can tell, the permissions of the host-directory of a
mount-point are hidden, at least for user-space.
(Ignoring (bind-)mounting the parent-mount somewhere else)

As root:
$ mkdir /tmp/test
$ ls -ld /tmp/test
drwxr-xr-x 2 root root 40 Oct 8 12:33 /tmp/test

$ mount tmpfs -t tmpfs /tmp/test
$ ls -ld /tmp/test
drwxrwxrwt 2 root root 40 Oct 8 12:33 /tmp/test

$ chown nobody.users /tmp/test
$ ls -ld /tmp/test
drwxrwxrwt 2 nobody users 40 Oct 8 12:33 /tmp/test

$ umount /tmp/test
$ ls -ld /tmp/test
drwxr-xr-x 2 root root 40 Oct 8 12:33 /tmp/test


So if the kernel would check the host-directory-permissions for allowing
umounting by rmdir it follows that a "plain user" doesn't have any
possibility to know beforehand if rmdir/umount would be possible.



--

Matthias

Eric W. Biederman

unread,
Oct 8, 2013, 9:20:02 AM10/8/13
to
Except the directory that is relevant to unlink/rmdir in your example is /tmp
not /tmp/test.

But thanks for the eyeball.

Eric

Miklos Szeredi

unread,
Oct 8, 2013, 9:50:03 AM10/8/13
to
On Fri, Oct 04, 2013 at 03:42:37PM -0700, Eric W. Biederman wrote:
>
> Signed-off-by: Eric W. Biederman <ebied...@twitter.com>
> ---
> fs/mount.h | 2 ++
> fs/namespace.c | 5 +++++
> 2 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index 64a858143ff9..e4342b8dfab1 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -21,6 +21,7 @@ struct mnt_pcp {
> struct mountpoint {
> struct list_head m_hash;
> struct dentry *m_dentry;
> + struct list_head m_list;
> int m_count;
> };
>
> @@ -47,6 +48,7 @@ struct mount {
> struct mount *mnt_master; /* slave is on master->mnt_slave_list */
> struct mnt_namespace *mnt_ns; /* containing namespace */
> struct mountpoint *mnt_mp; /* where is it mounted */
> + struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
> #ifdef CONFIG_FSNOTIFY
> struct hlist_head mnt_fsnotify_marks;
> __u32 mnt_fsnotify_mask;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index da5c49483430..d092964fe7f9 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -197,6 +197,7 @@ static struct mount *alloc_vfsmnt(const char *name)
> INIT_LIST_HEAD(&mnt->mnt_share);
> INIT_LIST_HEAD(&mnt->mnt_slave_list);
> INIT_LIST_HEAD(&mnt->mnt_slave);
> + INIT_LIST_HEAD(&mnt->mnt_mp_list);
> #ifdef CONFIG_FSNOTIFY
> INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
> #endif
> @@ -636,6 +637,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
> mp->m_dentry = dentry;
> mp->m_count = 1;
> list_add(&mp->m_hash, chain);
> + INIT_LIST_HEAD(&mp->m_list);
> return mp;
> }
>
> @@ -691,6 +693,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
> list_del_init(&mnt->mnt_hash);
> put_mountpoint(mnt->mnt_mp);
> mnt->mnt_mp = NULL;
> + list_del_init(&mnt->mnt_mp_list);

Should be done *before* put_mountpoint(), for obvious reasons.

And a BUG_ON(!list_empty(&mp->m_list)) in put_mountpoint() for good measure (and
no, WARN_ON() is not better here, since use-after-free is definitely worse than
a BUG).

> }
>
> /*
> @@ -705,6 +708,7 @@ void mnt_set_mountpoint(struct mount *mnt,
> child_mnt->mnt_mountpoint = dget(mp->m_dentry);
> child_mnt->mnt_parent = mnt;
> child_mnt->mnt_mp = mp;
> + list_add_tail(&child_mnt->mnt_mp_list, &mp->m_list);
> }
>
> /*
> @@ -1193,6 +1197,7 @@ void umount_tree(struct mount *mnt, int propagate)
> p->mnt_parent->mnt_ghosts++;
> put_mountpoint(p->mnt_mp);
> p->mnt_mp = NULL;
> + list_del_init(&mnt->mnt_mp_list);

Ditto.

> }
> change_mnt_propagation(p, MS_PRIVATE);
> }
> --
> 1.7.5.4

Miklos Szeredi

unread,
Oct 8, 2013, 10:00:02 AM10/8/13
to
On Fri, Oct 04, 2013 at 03:43:18PM -0700, Eric W. Biederman wrote:
>
> Signed-off-by: Eric W. Biederman <ebied...@twitter.com>
> ---
> fs/mount.h | 1 +
> fs/namespace.c | 24 ++++++++++++++++++++++++
> 2 files changed, 25 insertions(+), 0 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index e4342b8dfab1..7a6a2bb3f290 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -79,6 +79,7 @@ static inline int is_mounted(struct vfsmount *mnt)
> }
>
> extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
> +extern void detach_mounts(struct dentry *dentry);
>
> static inline void get_mnt_ns(struct mnt_namespace *ns)
> {
> diff --git a/fs/namespace.c b/fs/namespace.c
> index d092964fe7f9..8eaee0c14fdb 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1294,6 +1294,30 @@ static int do_umount(struct mount *mnt, int flags)
> return retval;
> }
>
> +void detach_mounts(struct dentry *dentry)
> +{
> + struct mount *mnt, *next;
> + struct mountpoint *mp;
> +
> + namespace_lock();
> + if (!d_mountpoint(dentry)) {
> + namespace_unlock();
> + return;
> + }
> + mp = new_mountpoint(dentry);
> + if (IS_ERR(mp)) {
> + return;
> + }
> + br_write_lock(&vfsmount_lock);
> + list_for_each_entry_safe(mnt, next, &mp->m_list, mnt_mp_list) {

I don't think list_for_each_entry_safe is actually safe enough here. Because
propageted umounts will likely remove other mounts from the same mountpoint as
well. Doing it with "while (!list_empty(&mp->m_list))" should be better, and
AFAICS it will always make progress.


> + if (!list_empty(&mnt->mnt_list))
> + umount_tree(mnt, 1);
> + }
> + br_write_unlock(&vfsmount_lock);
> + put_mountpoint(mp);
> + namespace_unlock();
> +}
> +
> /*
> * Is the caller allowed to modify his namespace?
> */

Miklos Szeredi

unread,
Oct 8, 2013, 12:00:02 PM10/8/13
to
On Fri, Oct 04, 2013 at 03:43:56PM -0700, Eric W. Biederman wrote:
>
> With the introduction of mount namespaces and bind mounts it because
> possible to access files and directories that in other locations in
> were used as mount points. Especially with mount namespaces has
> become very confusing why rm -rf somedir return -EBUSY because some
> directory is mounted somewhere else. With the addition of user
> namespaces allowing unprivileged mounts this condition has gone from
> annoying to allowing a DOS attack on more privileged users.
>
> The simplest approach appears to be to remove the -EBUSY message,
> allow unlink and rename, and lazily unmount the mount point.
>
> In most cases this is less surprising as this is an implementation
> of the normal unix behavior of allowing unlinking of files.
>
> The change implemented in this patch allows the following to succeed:
>
> The vfs does not currently follow paths up to the final component for
> the rename and unlink system calls making the boldest version of this
> idea the simplest to implement. Which should it simple to spot problems
> with this idea.
>
> While different from our historical behavior this change does not look
> like it will break anything, or introduce any security
> vulnerabilities. In a quick survey of all of the common mount points
> on linux systems I found mount points in directories owned and
> modifiable by root, and fuse fuse mounts in directories owned by the
> ``mounter'' of the fuse filesystem. In both of these cases relying on
> the permissions of the directory does not practically change the user
> who is allowed to unmount the filesystem.
>
> Attempting to anticipate cases I have not witnessed I observe that
> every directory in a trusted path to a file must limit modification
> such that no one else may modify that directory. For files trusted by
> suid root executables root most own and be the only user capable of
> modifying the directory and all parent directories for the files to be
> safe. Therefore for mount points part of a trusted path only root
> should be able to unlink any directory or file on that path. Which
> means after this change for a secured path only root can unmount
> directories.
>
> For mount points part of a path we can not trust we should not care if
> the just disappear, as that is just another kind of arbitrary
> manipulation.
>
> So I conclude that the existing conditions will ensure that the permissions
> on directories will be sufficiently limited that the new unmount on unlink
> behavior will not cause problems.
>
> Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
> ---
> fs/afs/dir.c | 3 +-
> fs/dcache.c | 80 ++++++++++++++++++++----------------------------
> fs/fuse/dir.c | 3 +-
> fs/gfs2/dentry.c | 4 +--
> fs/namei.c | 31 ++++++------------
> fs/nfs/dir.c | 5 +--
> fs/sysfs/dir.c | 9 +-----
> include/linux/dcache.h | 3 +-
> 8 files changed, 51 insertions(+), 87 deletions(-)
>
> diff --git a/fs/afs/dir.c b/fs/afs/dir.c
> index 646337dc5201..7fb69d45f1b9 100644
> --- a/fs/afs/dir.c
> +++ b/fs/afs/dir.c
> @@ -686,8 +686,7 @@ not_found:
>
> out_bad:
> /* don't unhash if we have submounts */
> - if (check_submounts_and_drop(dentry) != 0)
> - goto out_skip;
> + shrink_submounts_and_drop(dentry);
>
> _debug("dropping dentry %s/%s",
> parent->d_name.name, dentry->d_name.name);
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 41000305d716..1e9bf96b0132 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1373,7 +1373,7 @@ int d_set_mounted(struct dentry *dentry)
> int ret = -ENOENT;
> write_seqlock(&rename_lock);
> for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
> - /* Need exclusion wrt. check_submounts_and_drop() */
> + /* Need exclusion wrt. shrink_submounts_and_drop() */
> spin_lock(&p->d_lock);
> if (unlikely(d_unhashed(p))) {
> spin_unlock(&p->d_lock);
> @@ -1478,70 +1478,56 @@ void shrink_dcache_parent(struct dentry *parent)
> }
> EXPORT_SYMBOL(shrink_dcache_parent);
>
> -static enum d_walk_ret check_and_collect(void *_data, struct dentry *dentry)
> +struct detach_data {
> + struct dentry *found;
> +};
> +static enum d_walk_ret do_detach_submounts(void *ptr, struct dentry *dentry)
> {
> - struct select_data *data = _data;
> -
> - if (d_mountpoint(dentry)) {
> - data->found = -EBUSY;
> - return D_WALK_QUIT;
> - }
> -
> - return select_collect(_data, dentry);
> -}
> + struct detach_data *data = ptr;
>
> -static void check_and_drop(void *_data)
> -{
> - struct select_data *data = _data;
> + if (d_mountpoint(dentry))
> + data->found = dentry;
>
> - if (d_mountpoint(data->start))
> - data->found = -EBUSY;
> - if (!data->found)
> - __d_drop(data->start);
> + return data->found ? D_WALK_QUIT : D_WALK_CONTINUE;
> }
>
> /**
> - * check_submounts_and_drop - prune dcache, check for submounts and drop
> + * detach_submounts - check for submounts and detach them.
> *
> - * All done as a single atomic operation relative to has_unlinked_ancestor().
> - * Returns 0 if successfully unhashed @parent. If there were submounts then
> - * return -EBUSY.
> + * @dentry: dentry to find mount points under.
> *
> - * @dentry: dentry to prune and drop
> + * If dentry or any of it's children is a mount point detach those mounts.
> */
> -int check_submounts_and_drop(struct dentry *dentry)
> +void detach_submounts(struct dentry *dentry)
> {
> - int ret = 0;
> -
> - /* Negative dentries can be dropped without further checks */
> - if (!dentry->d_inode) {
> - d_drop(dentry);
> - goto out;
> - }
> -
> + struct detach_data data;
> for (;;) {
> - struct select_data data;
> -
> - INIT_LIST_HEAD(&data.dispose);
> - data.start = dentry;
> - data.found = 0;
> + data.found = NULL;
> + d_walk(dentry, &data, do_detach_submounts, NULL);
>
> - d_walk(dentry, &data, check_and_collect, check_and_drop);
> - ret = data.found;
> -
> - if (!list_empty(&data.dispose))
> - shrink_dentry_list(&data.dispose);
> -
> - if (ret <= 0)
> + if (!data.found)
> break;
>
> + detach_mounts(data.found);
> cond_resched();
> }
> + detach_mounts(dentry);
> +}
>
> -out:
> - return ret;
> +/**
> + * shrink_submounts_and_drop - detach submounts, prune dcache, and drop
> + *
> + * All done as a single atomic operation reletaive to d_set_mounted().
> + *
> + * @dentry: dentry to detach, prune and drop
> + */
> +void shrink_submounts_and_drop(struct dentry *dentry)
> +{
> + d_drop(dentry);
> + detach_submounts(dentry);

And here, between detach_submounts() and shrink_dcache_parent() a new mount can
be added.

It's not accidental that check_submounts_and_drop() did the check and the drop
together, protected by rename_lock and d_lock.

> + shrink_dcache_parent(dentry);
> }
> -EXPORT_SYMBOL(check_submounts_and_drop);
> +EXPORT_SYMBOL(shrink_submounts_and_drop);
>
> /**
> * __d_alloc - allocate a dcache entry
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 62b43b577bfc..b1cd7b79a325 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -259,8 +259,7 @@ out:
>
> invalid:
> ret = 0;
> - if (check_submounts_and_drop(entry) != 0)
> - ret = 1;
> + shrink_submounts_and_drop(entry);
> goto out;
> }
>
> diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
> index d3a5d4e29ba5..2ecc2b873829 100644
> --- a/fs/gfs2/dentry.c
> +++ b/fs/gfs2/dentry.c
> @@ -93,9 +93,7 @@ invalid_gunlock:
> if (!had_lock)
> gfs2_glock_dq_uninit(&d_gh);
> invalid:
> - if (check_submounts_and_drop(dentry) != 0)
> - goto valid;
> -
> + shrink_submounts_and_drop(dentry);
> dput(parent);
> return 0;
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 645268f23eb6..b18b017c946b 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3560,10 +3560,6 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
> dget(dentry);
> mutex_lock(&dentry->d_inode->i_mutex);
>
> - error = -EBUSY;
> - if (d_mountpoint(dentry))
> - goto out;
> -
> error = security_inode_rmdir(dir, dentry);
> if (error)
> goto out;
> @@ -3575,6 +3571,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
>
> dentry->d_inode->i_flags |= S_DEAD;
> dont_mount(dentry);
> + detach_mounts(dentry);
>
> out:
> mutex_unlock(&dentry->d_inode->i_mutex);
> @@ -3657,14 +3654,12 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
> return -EPERM;
>
> mutex_lock(&dentry->d_inode->i_mutex);
> - if (d_mountpoint(dentry))
> - error = -EBUSY;
> - else {
> - error = security_inode_unlink(dir, dentry);
> + error = security_inode_unlink(dir, dentry);
> + if (!error) {
> + error = dir->i_op->unlink(dir, dentry);
> if (!error) {
> - error = dir->i_op->unlink(dir, dentry);
> - if (!error)
> - dont_mount(dentry);
> + dont_mount(dentry);
> + detach_mounts(dentry);
> }
> }
> mutex_unlock(&dentry->d_inode->i_mutex);
> @@ -3988,10 +3983,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
> if (target)
> mutex_lock(&target->i_mutex);
>
> - error = -EBUSY;
> - if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
> - goto out;
> -

I know of at least one app that relied at some point on a mountpoint (directory
or non-directory) not being movable: fusermount uses this to ensure that
unprivileged userspace didn't try replacing a fuse mount with a symlink to trick
fusermount into umounting an arbitrary path. The code that relied on this was
replaced by UMOUNT_NOFOLLOW on kernels where it is supported. But in theory
there may exist a running binary without UMOUNT_NOFOLLOW and relying on EBUSY.

And there may be other such horrid hacks out there.

> error = -EMLINK;
> if (max_links && !target && new_dir != old_dir &&
> new_dir->i_nlink >= max_links)
> @@ -4006,6 +3997,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
> if (target) {
> target->i_flags |= S_DEAD;
> dont_mount(new_dentry);
> + detach_mounts(new_dentry);
> }
> out:
> if (target)
> @@ -4031,16 +4023,15 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
> if (target)
> mutex_lock(&target->i_mutex);
>
> - error = -EBUSY;
> - if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
> - goto out;
> -
> error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
> if (error)
> goto out;
>
> - if (target)
> + if (target) {
> dont_mount(new_dentry);
> + detach_mounts(new_dentry);
> + }
> + detach_mounts(old_dentry);

Why exactly? "Moved file changes contents" is not the least surprising result,
IMO. And why the difference between rename-dir and rename-other in this regard?

> if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
> d_move(old_dentry, new_dentry);
> out:
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 854a8f05a610..e8e35acd8850 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1142,10 +1142,7 @@ out_zap_parent:
> if (dentry->d_flags & DCACHE_DISCONNECTED)
> goto out_valid;
> }
> - /* If we have submounts, don't unhash ! */
> - if (check_submounts_and_drop(dentry) != 0)
> - goto out_valid;
> -
> + shrink_submounts_and_drop(dentry);
> dput(parent);
> dfprintk(LOOKUPCACHE, "NFS: %s(%s/%s) is invalid\n",
> __func__, dentry->d_parent->d_name.name,
> diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
> index 4d83cedb9fcb..477c66d4e2a8 100644
> --- a/fs/sysfs/dir.c
> +++ b/fs/sysfs/dir.c
> @@ -327,7 +327,6 @@ static int sysfs_dentry_revalidate(struct dentry *dentry, unsigned int flags)
> }
>
> mutex_unlock(&sysfs_mutex);
> -out_valid:
> return 1;
> out_bad:
> /* Remove the dentry from the dcache hashes.
> @@ -341,13 +340,7 @@ out_bad:
> * to the dcache hashes.
> */
> mutex_unlock(&sysfs_mutex);
> -
> - /* If we have submounts we must allow the vfs caches
> - * to lie about the state of the filesystem to prevent
> - * leaks and other nasty things.
> - */
> - if (check_submounts_and_drop(dentry) != 0)
> - goto out_valid;
> + shrink_submounts_and_drop(dentry);
>
> return 0;
> }
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 59066e0b4ff1..17948b49f3d5 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -254,7 +254,8 @@ extern void d_prune_aliases(struct inode *);
>
> /* test whether we have any submounts in a subdir tree */
> extern int have_submounts(struct dentry *);
> -extern int check_submounts_and_drop(struct dentry *);
> +extern void detach_submounts(struct dentry *dentry);
> +extern void shrink_submounts_and_drop(struct dentry *);
>
> /*
> * This adds the entry to the hash queues.

Andy Lutomirski

unread,
Oct 8, 2013, 12:10:02 PM10/8/13
to
On Tue, Oct 8, 2013 at 9:06 AM, Miklos Szeredi <mik...@szeredi.hu> wrote:
> On Mon, Oct 07, 2013 at 03:25:17PM -0700, Eric W. Biederman wrote:
>> Andy Lutomirski <lu...@amacapital.net> writes:
>>
>> > On Mon, Oct 7, 2013 at 7:55 AM, Eric W. Biederman <ebie...@xmission.com> wrote:
>> >> "Serge E. Hallyn" <se...@hallyn.com> writes:
>> >>
>> >>> Quoting Eric W. Biederman (ebie...@xmission.com):
>> >>>>
>> >>>> Programs have been known to test for empty directories by attempting
>> >>>> to remove them. To keep from violating the principle of least
>> >>>> surprise don't let directories the caller can see with someting
>> >>>> mounted on them be deleted.
>> >>>
>> >>> Do you think we should do the same thing for over-mounted file at
>> >>> vfs_unlink()?
>> >>
>> >> We easily could.
>
> And for vfs_rename()?
>
> I think the risks of changing behavior outweigh the benefits. How many times
> did you have to remove or rename a mounted file or directory? It's very rarely
> needed.

I do this every time I reinstall a system while running that system.
Admittedly, mount --move works, but that's a really unpleasant
interface.

When rename2 gets added, there could be a flag RENAME_MOVE_MOUNT to opt in.

--Andy

>
> Thanks,
> Miklos



--
Andy Lutomirski
AMA Capital Management, LLC

Miklos Szeredi

unread,
Oct 8, 2013, 12:10:04 PM10/8/13
to
On Mon, Oct 07, 2013 at 03:25:17PM -0700, Eric W. Biederman wrote:
> Andy Lutomirski <lu...@amacapital.net> writes:
>
> > On Mon, Oct 7, 2013 at 7:55 AM, Eric W. Biederman <ebie...@xmission.com> wrote:
> >> "Serge E. Hallyn" <se...@hallyn.com> writes:
> >>
> >>> Quoting Eric W. Biederman (ebie...@xmission.com):
> >>>>
> >>>> Programs have been known to test for empty directories by attempting
> >>>> to remove them. To keep from violating the principle of least
> >>>> surprise don't let directories the caller can see with someting
> >>>> mounted on them be deleted.
> >>>
> >>> Do you think we should do the same thing for over-mounted file at
> >>> vfs_unlink()?
> >>
> >> We easily could.

And for vfs_rename()?

I think the risks of changing behavior outweigh the benefits. How many times
did you have to remove or rename a mounted file or directory? It's very rarely
needed.

Thanks,
Miklos

Miklos Szeredi

unread,
Oct 8, 2013, 12:20:02 PM10/8/13
to
On Tue, Oct 08, 2013 at 09:06:29AM -0700, Andy Lutomirski wrote:

> > I think the risks of changing behavior outweigh the benefits. How many
> > times did you have to remove or rename a mounted file or directory? It's
> > very rarely needed.
>
> I do this every time I reinstall a system while running that system.
> Admittedly, mount --move works, but that's a really unpleasant
> interface.
>
> When rename2 gets added, there could be a flag RENAME_MOVE_MOUNT to opt in.

Good point. Opting in would be the safest for both unlinkat() and
renameat2().

Thanks,
Miklos

Eric W. Biederman

unread,
Oct 8, 2013, 4:40:01 PM10/8/13
to
Miklos Szeredi <mik...@szeredi.hu> writes:

> On Fri, Oct 04, 2013 at 03:43:18PM -0700, Eric W. Biederman wrote:
>>
>> +void detach_mounts(struct dentry *dentry)
>> +{
>> + struct mount *mnt, *next;
>> + struct mountpoint *mp;
>> +
>> + namespace_lock();
>> + if (!d_mountpoint(dentry)) {
>> + namespace_unlock();
>> + return;
>> + }
>> + mp = new_mountpoint(dentry);
>> + if (IS_ERR(mp)) {
>> + return;
>> + }
>> + br_write_lock(&vfsmount_lock);
>> + list_for_each_entry_safe(mnt, next, &mp->m_list, mnt_mp_list) {
>
> I don't think list_for_each_entry_safe is actually safe enough here. Because
> propageted umounts will likely remove other mounts from the same mountpoint as
> well. Doing it with "while (!list_empty(&mp->m_list))" should be better, and
> AFAICS it will always make progress.

Thanks I will take a look. But on the surface that logic sounds
correct.

Eric

Eric W. Biederman

unread,
Oct 8, 2013, 4:40:02 PM10/8/13
to
Miklos Szeredi <mik...@szeredi.hu> writes:

>> @@ -691,6 +693,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
>> list_del_init(&mnt->mnt_hash);
>> put_mountpoint(mnt->mnt_mp);
>> mnt->mnt_mp = NULL;
>> + list_del_init(&mnt->mnt_mp_list);
>
> Should be done *before* put_mountpoint(), for obvious reasons.
>
> And a BUG_ON(!list_empty(&mp->m_list)) in put_mountpoint() for good measure (and
> no, WARN_ON() is not better here, since use-after-free is definitely worse than
> a BUG).

Good point. I will take a look at making this change shortly and repost
the patches.

Eric W. Biederman

unread,
Oct 8, 2013, 5:00:03 PM10/8/13
to
Miklos Szeredi <mik...@szeredi.hu> writes:

> On Tue, Oct 08, 2013 at 09:06:29AM -0700, Andy Lutomirski wrote:
>
>> > I think the risks of changing behavior outweigh the benefits. How many
>> > times did you have to remove or rename a mounted file or directory? It's
>> > very rarely needed.
>>
>> I do this every time I reinstall a system while running that system.
>> Admittedly, mount --move works, but that's a really unpleasant
>> interface.
>>
>> When rename2 gets added, there could be a flag RENAME_MOVE_MOUNT to opt in.
>
> Good point. Opting in would be the safest for both unlinkat() and
> renameat2().

Opting in to allowing mounts to be unlinked and renamed? Or opting in
to not renaming/unlinking a mount point?

I can see opting in to denying the rename/unlink if that is someting
someone wants for some reason. With an opt in we can even use the
previous d_mountpoint check and not stomp someone's else's weird set of
mounts in some other mount namespace if they are on the local machine
and that is desired.



The rmdir non-empty dir semantics justify blocking rmdir.



If we are going to fix the VFS deficiency we have to let these changes
happen in other mount namespaces. To make that safe it has to be
sufficient to rely on the directory permissions and the conditions that
ensure that the directory permissions are sufficient.

So I find it far safer to allow as much as possible even in the local
mount namespace so we can actually see if there are problems with
relying on the directory permissions.

Eric

Eric W. Biederman

unread,
Oct 8, 2013, 5:50:02 PM10/8/13
to
Miklos Szeredi <mik...@szeredi.hu> writes:

> On Fri, Oct 04, 2013 at 03:43:56PM -0700, Eric W. Biederman wrote:
>> +/**
>> + * shrink_submounts_and_drop - detach submounts, prune dcache, and drop
>> + *
>> + * All done as a single atomic operation reletaive to d_set_mounted().
>> + *
>> + * @dentry: dentry to detach, prune and drop
>> + */
>> +void shrink_submounts_and_drop(struct dentry *dentry)
>> +{
>> + d_drop(dentry);
>> + detach_submounts(dentry);
>
> And here, between detach_submounts() and shrink_dcache_parent() a new mount can
> be added.

Actually it is not possible to add a new mount betwee detach_submounts
and shrink_dcache_parent d_set_mountpoint will see the unhashed
dentry as an unhashed parent and refuse to add any new mount points.

> It's not accidental that check_submounts_and_drop() did the check and the drop
> together, protected by rename_lock and d_lock.

When I looked the locking of detach_mount prevented me from using the
same technique as check_submounts_and_drop. But happily
d_set_mountpoint now checks for unhashed parents and prevents carnage at
that point.

>> + shrink_dcache_parent(dentry);
>> }
>> -EXPORT_SYMBOL(check_submounts_and_drop);
>> +EXPORT_SYMBOL(shrink_submounts_and_drop);
>>

>> @@ -3988,10 +3983,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
>> if (target)
>> mutex_lock(&target->i_mutex);
>>
>> - error = -EBUSY;
>> - if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
>> - goto out;
>> -
>
> I know of at least one app that relied at some point on a mountpoint (directory
> or non-directory) not being movable: fusermount uses this to ensure that
> unprivileged userspace didn't try replacing a fuse mount with a symlink to trick
> fusermount into umounting an arbitrary path. The code that relied on this was
> replaced by UMOUNT_NOFOLLOW on kernels where it is supported. But in theory
> there may exist a running binary without UMOUNT_NOFOLLOW and relying on EBUSY.
>
> And there may be other such horrid hacks out there.

That is certainly worth thinking about.

In practice if I rename a parent directory I can rename mount points
today. So I don't know that it was ever fully safe to rely on rename
prevention.

We may be able to keep this rule just for the local mount namespace if
that would help.

If we are going to remove lying to the VFS removing the d_mountpoint
restriction of rename is necessary.


It gets scary to rely on the checks only applying locally, because of
things like mount --bind /some/path /some/other/path, and because of
cases like sysfs and nfs where the mutation path is not through the
local vfs and simply bypasses all of these changes.

So I will think about this case some more but I don't think I can make
this change and remain hack compatile with old kernels.

I hope we don't find any horrible hacks that we absolutely must support
because at best that puts everything back to the drawing board.

But I will go through and read the old fusermount code before I get too
much farther just so I understand what I am potentially breaking.
Because detach_mounts(old_dentry) is a bug.

detach_mounts(new_dentry) is needed so that it is safe to put new_dentry
a few lines

Eric

Eric W. Biederman

unread,
Oct 9, 2013, 3:20:02 PM10/9/13
to
ebie...@xmission.com (Eric W. Biederman) writes:

> But I will go through and read the old fusermount code before I get too
> much farther just so I understand what I am potentially breaking.

Grr.

So I have just read the fusermount umount code and the hack that it uses
before there was UMOUNT_NOFOLLOW support in the vm.

If I walk this path of lazy unmounts and detaching directories, anyone
with a new kernel and an old copy of fusermount and a nfs mounted home
directory will be able to exploit the fusermount umount symlink race.

Unless we can declare that old fusermount binaries are buggy beyond
supporting this patchset as it exists is dead.

Andy Lutomirski

unread,
Oct 9, 2013, 4:10:02 PM10/9/13
to
On Wed, Oct 9, 2013 at 12:12 PM, Eric W. Biederman
<ebie...@xmission.com> wrote:
> ebie...@xmission.com (Eric W. Biederman) writes:
>
>> But I will go through and read the old fusermount code before I get too
>> much farther just so I understand what I am potentially breaking.
>
> Grr.
>
> So I have just read the fusermount umount code and the hack that it uses
> before there was UMOUNT_NOFOLLOW support in the vm.
>
> If I walk this path of lazy unmounts and detaching directories, anyone
> with a new kernel and an old copy of fusermount and a nfs mounted home
> directory will be able to exploit the fusermount umount symlink race.
>
> Unless we can declare that old fusermount binaries are buggy beyond
> supporting this patchset as it exists is dead.

What's the hack that it does?

--Andy

Eric W. Biederman

unread,
Oct 9, 2013, 7:40:02 PM10/9/13
to
Andy Lutomirski <lu...@amacapital.net> writes:

> On Wed, Oct 9, 2013 at 12:12 PM, Eric W. Biederman
> <ebie...@xmission.com> wrote:
>> ebie...@xmission.com (Eric W. Biederman) writes:
>>
>>> But I will go through and read the old fusermount code before I get too
>>> much farther just so I understand what I am potentially breaking.
>>
>> Grr.
>>
>> So I have just read the fusermount umount code and the hack that it uses
>> before there was UMOUNT_NOFOLLOW support in the vm.
>>
>> If I walk this path of lazy unmounts and detaching directories, anyone
>> with a new kernel and an old copy of fusermount and a nfs mounted home
>> directory will be able to exploit the fusermount umount symlink race.
>>
>> Unless we can declare that old fusermount binaries are buggy beyond
>> supporting this patchset as it exists is dead.
>
> What's the hack that it does?

The problem is that "umount -l /some/user/supplied/path" can unmount
anything.

Even after checking mtab etc there are races, and replacing "path" with
a symlink will still allow you to unmount anything.

So when not usering UMOUNT_NOFOLLOW fusermount will do:

cd "/some/user/supplied/"

clone(CLONE_NEWNS)
mount --make-rprivate /
COUNT=$(cat /proc/mounts | wc -l)
mount --rbind . /
Look at new /proc/mounts entries past $COUNT and see if there is an
entry for "path"
if $found
exit 0
else
exit 1
wait
if ($? == 0) umount -l "path"

As everything happens in the directory just below the mount point
we are only concerned about things that happen in that directory.

This does mean there is a window between checking that path is
a valid fuse mount point and when that mountpoint is unmounted.

So if someone can unlink/rename the mount point and replace it
with a symlink during that small window it is possible to unmount
anything.

Today the d_mountpoint tests and the requirement distributed filesystems
(aka nfs) lie to the vfs prevent those issues.

The UMOUNT_NOFOLLOW code which appears in fuse 2.9.0 is new enough that
distro's like CentOS6 don't ship it yet.

Eric

Rob Landley

unread,
Oct 10, 2013, 2:50:01 AM10/10/13
to
On 10/08/2013 03:03:03 AM, Karel Zak wrote:
> On Sat, Oct 05, 2013 at 06:42:44PM -0500, Rob Landley wrote:
> > Oh, attached is a dumb "zapchroot" script I've been using for years
> to
> > unlink all mount points under a given directory, taking advantage
> of the
> > fact that mount points are appended to the end of the list so if
> you unlink
> > from the end to the front you should get the sub-mounts before the
> parent
> > mounts (modulo mount --move not reordering the list, but that's
> uncommon).
>
> util-linux umount supports --recursive, it uses /proc/self/mountinfo
> to compose the hierarchy. The important is that the mountinfo file
> contains Id and Parent_Id relations, so you don't rely on the order
> only.

Ah, that's what happened.

For some reason /proc/self/mounts stopped adjusting itself for chroot a
while back, apparently because containers use switch_root instead so
chroot is generally deprecated or something? This made /proc/mounts
completely useless in a chroot because the paths it showed were not the
ones you actually had to use to umount anything.

Instead they added a new way to get this info, in a new format where
you have to parse the 5th field out of each line to get the mount point
(to make it friendly to scripts and tools).

Good to know.

Rob--

Miklos Szeredi

unread,
Oct 10, 2013, 6:10:02 AM10/10/13
to
On Tue, Oct 8, 2013 at 10:50 PM, Eric W. Biederman
<ebie...@xmission.com> wrote:

> If we are going to fix the VFS deficiency we have to let these changes
> happen in other mount namespaces. To make that safe it has to be
> sufficient to rely on the directory permissions and the conditions that
> ensure that the directory permissions are sufficient.

Yes.

> So I find it far safer to allow as much as possible even in the local
> mount namespace so we can actually see if there are problems with
> relying on the directory permissions.

But it's not safer in terms of breaking legacy environments, which are
single namespace, usually.

I see it this way: the current behavior (blocking unlink/rename if
mounted) may not be nice, but it's expected. But once we have
multiple namespaces it becomes problematic. Just like network
filesystems where we cannot prevent unlink on another host just
because it's mounted on this. So we have to have some way of dealing
with this.

Is it worth changing behavor in the "current namespace" case too? It
has advantages, but it's not absolutely necessary and there are
unknown dangers. But if we disable it for rmdir() then what's the
point in enabling it for unlink()? Directory mounts far outnumber
non-directory ones, so the additional testing is minimal, but the
holes left are probably just as real.

And there's rename(). We have a real security hole in fusermount by
allowing it in the local namespace. I don't think changing the
behavior of rename() in the local namespace is important enough to
risk such problems. And as Andy pointed out, we can just have an
option to turn it on and that's that.

Thanks,
Miklos

Eric W. Biederman

unread,
Oct 10, 2013, 7:50:02 AM10/10/13
to
I have been weighing the pros and the cons of this.

At this point the most practical path I can see is to block unlink,
rename, and rmdir if there is a mount in the local namespace.

At the very least that makes very limited additions to what applications
can depend on from the vfs, and even more importantly that can be ready
by 3.13.

Furthermore it removes the nasty need for filesystems to lie to the vfs,
and removes the nasty DOS that non-local mountpoints are.

Miklos if you as the fuse maintainer aren't worried about network
filesystems, and multiple namespaces I won't worry either. Especially
since modern versions of fuse aren't affected.

I have a implementable looking plan B if we really need it of creating
ghost dentries for mounts to ride on if their underlying dentry gets
renamed or unlinked. And I don't see this removing the possibility of
implementing that plan B. But ick. I think deciding we need plan B
would just enough more work that it would kill getting any change made
in a timely fashion. So I will take my little blue pill and ignore the
possible slings and arrows of outrageous fortune, and see what happens.

Eric

Miklos Szeredi

unread,
Oct 10, 2013, 8:00:02 AM10/10/13
to
On Thu, Oct 10, 2013 at 1:43 PM, Eric W. Biederman
<ebie...@xmission.com> wrote:

> I have been weighing the pros and the cons of this.
>
> At this point the most practical path I can see is to block unlink,
> rename, and rmdir if there is a mount in the local namespace.
>
> At the very least that makes very limited additions to what applications
> can depend on from the vfs, and even more importantly that can be ready
> by 3.13.
>
> Furthermore it removes the nasty need for filesystems to lie to the vfs,
> and removes the nasty DOS that non-local mountpoints are.

Agreed 100%.

> Miklos if you as the fuse maintainer aren't worried about network
> filesystems, and multiple namespaces I won't worry either. Especially
> since modern versions of fuse aren't affected.

I think the above conditions (local mount blocks unlink/rename) are
enough to prevent most of the problems, of which there aren't many in
any case.

Thanks,
Miklos

Eric W. Biederman

unread,
Oct 11, 2013, 9:10:02 PM10/11/13
to
Miklos Szeredi <mik...@szeredi.hu> writes:

> On Thu, Oct 10, 2013 at 1:43 PM, Eric W. Biederman
>> Miklos if you as the fuse maintainer aren't worried about network
>> filesystems, and multiple namespaces I won't worry either. Especially
>> since modern versions of fuse aren't affected.
>
> I think the above conditions (local mount blocks unlink/rename) are
> enough to prevent most of the problems, of which there aren't many in
> any case.

Dumb question.

What prevents someone setting up a race between the fusermount
permission checks and replacing the destination with a symlink, perhaps
to /etc/shadow?

Do we need a MS_NOFOLLOW?

Eric

Eric W. Biederman

unread,
Oct 11, 2013, 9:50:01 PM10/11/13
to
ebie...@xmission.com (Eric W. Biederman) writes:

> Miklos Szeredi <mik...@szeredi.hu> writes:
>
>> On Thu, Oct 10, 2013 at 1:43 PM, Eric W. Biederman
>>> Miklos if you as the fuse maintainer aren't worried about network
>>> filesystems, and multiple namespaces I won't worry either. Especially
>>> since modern versions of fuse aren't affected.
>>
>> I think the above conditions (local mount blocks unlink/rename) are
>> enough to prevent most of the problems, of which there aren't many in
>> any case.
>
> Dumb question.
>
> What prevents someone setting up a race between the fusermount
> permission checks and replacing the destination with a symlink, perhaps
> to /etc/shadow?
>
> Do we need a MS_NOFOLLOW?

Doh! mount(".",...) works just fine..

My apologies for the silly question.

Eric W. Biederman

unread,
Oct 15, 2013, 4:20:02 PM10/15/13
to

This patchset is an addresses two problems:
1) Not all modifications to the filesystems happen through the vfs and
since the vfs can not cope with a mount point being unlinked or
renamed filesystems whose modifications that do not come through the
vfs are required to lie.

2) Through an oversight it is now possible for one unprivileged user to
mount something on another unprivileged users dentry and make it
impossible for the other user to unlink or rename that dentry.

It is now technically possible to easily lift the restriction on
unlinking and renaming files with mount points on them, with a
corresponding reduction in complexity of the vfs semantics.

After review it seems that there are no objections to this approach as
long as we retain the -EBUSY semantics for rmdir, unlink, and rename of
mount points in the current mount namespace. The first patch in this
series now adds those local mount namespace restrictions.

All of the review comments should now be addressed and folded in, and
I have take a careful look and it appears what I have is now correct
and complete. So I am posting this for one last round of review.

Al if you want to take this through the vfs tree, point me at a branch
and I will give you versions of these patches that apply cleanly there.
Otherwise I will push these patches to my userns tree as soon as all of
these patches pass review.

Eric W. Biederman (4):
vfs: Don't allow overwriting mounts in the current mount namespace
vfs: Keep a list of mounts on a mount point
vfs: Add a function to lazily unmount all mounts from any dentry. v3
vfs: Lazily remove mounts on unlinked files and directories. v2

fs/afs/dir.c | 3 +-
fs/dcache.c | 80 ++++++++++++++++++++----------------------------
fs/fuse/dir.c | 3 +-
fs/gfs2/dentry.c | 4 +--
fs/mount.h | 3 ++
fs/namei.c | 55 +++++++++++++++++++++------------
fs/namespace.c | 30 ++++++++++++++++++
fs/nfs/dir.c | 5 +--
fs/sysfs/dir.c | 9 +-----
include/linux/dcache.h | 3 +-
10 files changed, 108 insertions(+), 87 deletions(-)

Eric W. Biederman

unread,
Oct 15, 2013, 4:20:02 PM10/15/13
to

v2: Always drop the lock when exiting early.
v3: Make detach_mounts robust about freeing several
mounts on the same mountpoint at one time, and remove
the unneeded mnt_list list test.

Signed-off-by: Eric W. Biederman <ebied...@twitter.com>
---
fs/mount.h | 1 +
fs/namespace.c | 24 ++++++++++++++++++++++++
2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index e4342b8dfab1..7a6a2bb3f290 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -79,6 +79,7 @@ static inline int is_mounted(struct vfsmount *mnt)
}

extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
+extern void detach_mounts(struct dentry *dentry);

static inline void get_mnt_ns(struct mnt_namespace *ns)
{
diff --git a/fs/namespace.c b/fs/namespace.c
index e4fe22c23e00..78f7c5c9e673 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1295,6 +1295,30 @@ static int do_umount(struct mount *mnt, int flags)
return retval;
}

+void detach_mounts(struct dentry *dentry)
+{
+ struct mountpoint *mp;
+ struct mount *mnt;
+
+ namespace_lock();
+ if (!d_mountpoint(dentry))
+ goto out_unlock;
+
+ mp = new_mountpoint(dentry);
+ if (IS_ERR(mp))
+ goto out_unlock;
+
+ br_write_lock(&vfsmount_lock);
+ while (!list_empty(&mp->m_list)) {
+ mnt = list_first_entry(&mp->m_list, struct mount, mnt_mp_list);
+ umount_tree(mnt, 1);
+ }
+ br_write_unlock(&vfsmount_lock);
+ put_mountpoint(mp);
+out_unlock:
+ namespace_unlock();
+}
+
/*
* Is the caller allowed to modify his namespace?
*/
--
1.7.5.4

Eric W. Biederman

unread,
Oct 15, 2013, 4:20:02 PM10/15/13
to

With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths. It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else. With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.

The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.

In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.

There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories
maintained by fusermount.

In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.

In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger. fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.

In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary. Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.

v2: Remove spurious old_dentry.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/afs/dir.c | 3 +-
fs/dcache.c | 80 ++++++++++++++++++++----------------------------
fs/fuse/dir.c | 3 +-
fs/gfs2/dentry.c | 4 +--
fs/namei.c | 30 ++++++------------
fs/nfs/dir.c | 5 +--
fs/sysfs/dir.c | 9 +-----
include/linux/dcache.h | 3 +-
8 files changed, 50 insertions(+), 87 deletions(-)
+/**
+ * shrink_submounts_and_drop - detach submounts, prune dcache, and drop
+ *
+ * All done as a single atomic operation reletaive to d_set_mounted().
+ *
+ * @dentry: dentry to detach, prune and drop
+ */
+void shrink_submounts_and_drop(struct dentry *dentry)
+{
+ d_drop(dentry);
+ detach_submounts(dentry);
index df7bd6e9c7b6..a12c1d31d4c8 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3574,10 +3574,6 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
dget(dentry);
mutex_lock(&dentry->d_inode->i_mutex);

- error = -EBUSY;
- if (d_mountpoint(dentry))
- goto out;
-
error = security_inode_rmdir(dir, dentry);
if (error)
goto out;
@@ -3589,6 +3585,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)

dentry->d_inode->i_flags |= S_DEAD;
dont_mount(dentry);
+ detach_mounts(dentry);

out:
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -3674,14 +3671,12 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
return -EPERM;

mutex_lock(&dentry->d_inode->i_mutex);
- if (d_mountpoint(dentry))
- error = -EBUSY;
- else {
- error = security_inode_unlink(dir, dentry);
+ error = security_inode_unlink(dir, dentry);
+ if (!error) {
+ error = dir->i_op->unlink(dir, dentry);
if (!error) {
- error = dir->i_op->unlink(dir, dentry);
- if (!error)
- dont_mount(dentry);
+ dont_mount(dentry);
+ detach_mounts(dentry);
}
}
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -4008,10 +4003,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
if (target)
mutex_lock(&target->i_mutex);

- error = -EBUSY;
- if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
- goto out;
-
error = -EMLINK;
if (max_links && !target && new_dir != old_dir &&
new_dir->i_nlink >= max_links)
@@ -4026,6 +4017,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
if (target) {
target->i_flags |= S_DEAD;
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
}
out:
if (target)
@@ -4051,16 +4043,14 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
if (target)
mutex_lock(&target->i_mutex);

- error = -EBUSY;
- if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
- goto out;
-
error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
if (error)
goto out;

- if (target)
+ if (target) {
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
+ }

Eric W. Biederman

unread,
Oct 15, 2013, 4:20:04 PM10/15/13
to

To spot any possible problems call BUG if a mountpoint
is put when it's list of mounts is not empty.

Signed-off-by: Eric W. Biederman <ebied...@twitter.com>
---
fs/mount.h | 2 ++
fs/namespace.c | 6 ++++++
2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 64a858143ff9..e4342b8dfab1 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -21,6 +21,7 @@ struct mnt_pcp {
struct mountpoint {
struct list_head m_hash;
struct dentry *m_dentry;
+ struct list_head m_list;
int m_count;
};

@@ -47,6 +48,7 @@ struct mount {
struct mount *mnt_master; /* slave is on master->mnt_slave_list */
struct mnt_namespace *mnt_ns; /* containing namespace */
struct mountpoint *mnt_mp; /* where is it mounted */
+ struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
#ifdef CONFIG_FSNOTIFY
struct hlist_head mnt_fsnotify_marks;
__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index da5c49483430..e4fe22c23e00 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -197,6 +197,7 @@ static struct mount *alloc_vfsmnt(const char *name)
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
INIT_LIST_HEAD(&mnt->mnt_slave);
+ INIT_LIST_HEAD(&mnt->mnt_mp_list);
#ifdef CONFIG_FSNOTIFY
INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
#endif
@@ -636,6 +637,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
mp->m_dentry = dentry;
mp->m_count = 1;
list_add(&mp->m_hash, chain);
+ INIT_LIST_HEAD(&mp->m_list);
return mp;
}

@@ -643,6 +645,7 @@ static void put_mountpoint(struct mountpoint *mp)
{
if (!--mp->m_count) {
struct dentry *dentry = mp->m_dentry;
+ BUG_ON(!list_empty(&mp->m_list));
spin_lock(&dentry->d_lock);
dentry->d_flags &= ~DCACHE_MOUNTED;
spin_unlock(&dentry->d_lock);
@@ -689,6 +692,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
list_del_init(&mnt->mnt_child);
list_del_init(&mnt->mnt_hash);
+ list_del_init(&mnt->mnt_mp_list);
put_mountpoint(mnt->mnt_mp);
mnt->mnt_mp = NULL;
}
@@ -705,6 +709,7 @@ void mnt_set_mountpoint(struct mount *mnt,
child_mnt->mnt_mountpoint = dget(mp->m_dentry);
child_mnt->mnt_parent = mnt;
child_mnt->mnt_mp = mp;
+ list_add_tail(&child_mnt->mnt_mp_list, &mp->m_list);
}

/*
@@ -1191,6 +1196,7 @@ void umount_tree(struct mount *mnt, int propagate)
list_del_init(&p->mnt_child);
if (mnt_has_parent(p)) {
p->mnt_parent->mnt_ghosts++;
+ list_del_init(&p->mnt_mp_list);
put_mountpoint(p->mnt_mp);
p->mnt_mp = NULL;
}
--
1.7.5.4

Eric W. Biederman

unread,
Oct 15, 2013, 4:20:04 PM10/15/13
to

In preparation for allowing mountpoints to be renamed and unlinked
in remote filesystems and in other mount namespaces test if on a dentry
there is a mount in the local mount namespace before allowing it to
be renamed or unlinked.

The primary motivation here are old versions of fusermount unmount
which is not safe if the a path can be renamed or unlinked while it is
verifying the mount is safe to unmount. More recent versions are simpler
and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
in a directory owned by an arbitrary user.

Miklos Szeredi <mik...@szeredi.hu> reports this is approach is good
enough to remove concerns about new kernels mixed with old versions
of fusermount.

A secondary motivation for restrictions here is that it removing empty
directories that have non-empty mount points on them appears to
violate the rule that rmdir can not remove empty directories. As
Linus Torvalds pointed out this is useful for programs (like git) that
test if a directory is empty with rmdir.

Therefore this patch arranges to enforce the existing mount point
semantics for local mount namespace.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/namei.c | 25 +++++++++++++++++++++++++
1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 645268f23eb6..df7bd6e9c7b6 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3547,6 +3547,20 @@ void dentry_unhash(struct dentry *dentry)
spin_unlock(&dentry->d_lock);
}

+static bool covered(struct vfsmount *mnt, struct dentry *dentry)
+{
+ /* test to see if a dentry is covered with a mount in
+ * the current mount namespace.
+ */
+ bool is_covered;
+
+ rcu_read_lock();
+ is_covered = d_mountpoint(dentry) && __lookup_mnt(mnt, dentry, 1);
+ rcu_read_unlock();
+
+ return is_covered;
+}
+
int vfs_rmdir(struct inode *dir, struct dentry *dentry)
{
int error = may_delete(dir, dentry, 1);
@@ -3622,6 +3636,9 @@ retry:
error = -ENOENT;
goto exit3;
}
+ error = -EBUSY;
+ if (covered(nd.path.mnt, dentry))
+ goto exit3;
error = security_path_rmdir(&nd.path, dentry);
if (error)
goto exit3;
@@ -3716,6 +3733,9 @@ retry:
inode = dentry->d_inode;
if (!inode)
goto slashes;
+ error = -EBUSY;
+ if (covered(nd.path.mnt, dentry))
+ goto exit2;
ihold(inode);
error = security_path_unlink(&nd.path, dentry);
if (error)
@@ -4164,6 +4184,11 @@ retry:
error = -ENOTEMPTY;
if (new_dentry == trap)
goto exit5;
+ error = -EBUSY;
+ if (covered(oldnd.path.mnt, old_dentry))
+ goto exit5;
+ if (covered(newnd.path.mnt, new_dentry))
+ goto exit5;

error = security_path_rename(&oldnd.path, old_dentry,
&newnd.path, new_dentry);
--
1.7.5.4

Serge E. Hallyn

unread,
Oct 22, 2013, 3:10:03 PM10/22/13
to
Quoting Eric W. Biederman (ebie...@xmission.com):
>
> To spot any possible problems call BUG if a mountpoint
> is put when it's list of mounts is not empty.
>
> Signed-off-by: Eric W. Biederman <ebied...@twitter.com>

Acked-by: Serge Hallyn <serge....@canonical.com>

Serge E. Hallyn

unread,
Oct 22, 2013, 3:10:02 PM10/22/13
to
Quoting Eric W. Biederman (ebie...@xmission.com):
>
> In preparation for allowing mountpoints to be renamed and unlinked
> in remote filesystems and in other mount namespaces test if on a dentry
> there is a mount in the local mount namespace before allowing it to
> be renamed or unlinked.
>
> The primary motivation here are old versions of fusermount unmount
> which is not safe if the a path can be renamed or unlinked while it is
> verifying the mount is safe to unmount. More recent versions are simpler
> and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
> in a directory owned by an arbitrary user.
>
> Miklos Szeredi <mik...@szeredi.hu> reports this is approach is good
> enough to remove concerns about new kernels mixed with old versions
> of fusermount.
>
> A secondary motivation for restrictions here is that it removing empty
> directories that have non-empty mount points on them appears to
> violate the rule that rmdir can not remove empty directories. As
> Linus Torvalds pointed out this is useful for programs (like git) that
> test if a directory is empty with rmdir.
>
> Therefore this patch arranges to enforce the existing mount point
> semantics for local mount namespace.
>
> Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>

Acked-by: Serge Hallyn <serge....@canonical.com>

Serge E. Hallyn

unread,
Oct 22, 2013, 3:10:02 PM10/22/13
to
Quoting Eric W. Biederman (ebie...@xmission.com):
>
> v2: Always drop the lock when exiting early.
> v3: Make detach_mounts robust about freeing several
> mounts on the same mountpoint at one time, and remove
> the unneeded mnt_list list test.
>
> Signed-off-by: Eric W. Biederman <ebied...@twitter.com>

Acked-by: Serge Hallyn <serge....@canonical.com>

Serge E. Hallyn

unread,
Oct 22, 2013, 3:20:02 PM10/22/13
to
Quoting Eric W. Biederman (ebie...@xmission.com):
>
Acked-by: Serge Hallyn <serge....@canonical.com>

Al Viro

unread,
Nov 3, 2013, 12:00:01 AM11/3/13
to
On Tue, Oct 15, 2013 at 01:16:48PM -0700, Eric W. Biederman wrote:

> int vfs_rmdir(struct inode *dir, struct dentry *dentry)
> {
> int error = may_delete(dir, dentry, 1);
> @@ -3622,6 +3636,9 @@ retry:
> error = -ENOENT;
> goto exit3;
> }
> + error = -EBUSY;
> + if (covered(nd.path.mnt, dentry))
> + goto exit3;

Ugh... And it's not racy because of...? IOW, what's to keep the return
value of covered() from getting obsolete just as it's being calculated,
let alone returned?

Eric W. Biederman

unread,
Nov 8, 2013, 4:00:02 PM11/8/13
to
Al Viro <vi...@ZenIV.linux.org.uk> writes:

> On Tue, Oct 15, 2013 at 01:16:48PM -0700, Eric W. Biederman wrote:
>
>> int vfs_rmdir(struct inode *dir, struct dentry *dentry)
>> {
>> int error = may_delete(dir, dentry, 1);
>> @@ -3622,6 +3636,9 @@ retry:
>> error = -ENOENT;
>> goto exit3;
>> }
>> + error = -EBUSY;
>> + if (covered(nd.path.mnt, dentry))
>> + goto exit3;
>
> Ugh... And it's not racy because of...? IOW, what's to keep the return
> value of covered() from getting obsolete just as it's being calculated,
> let alone returned?

The return value of d_mountpoint can be obsolete as soon as it returns
as well, so I don't see this as being significantly different.

I would like to say that any changes introduced here do not matter
because all of this is just to keep a semblance of the old semantics.
Unfortunately for me part of keeping that semblance is as much as is
reasonable preserving the existing race guarantees.

In 3.12 we create a mount with:
- The dentry->d_inode mutex held.
- The namespace_sem held.

In 3.12 we remove a mount with just the namespace_sem held.

I call covered in: do_rmdir, do_unlinkat, and renameat.

In 3.12 vfs_rmdir checks d_mountpoint with the
dentry->d_inode->i_mutex and
dentry->d_parent->d_inode->i_mutex held.

In 3.12 vfs_unlink checks d_mountpoint with the
dentry->d_inode->i_mutex and
dentry->d_parent->d_inode->i_mutex hel.d

In 3.12 vfs_rename_dir and vfs_rename_other checks d_mountpint with the
target->i_mutex, new_dir->i_mutex, and old_dir->i_mutex held.


Therefore the guarantees in 3.12 are:
- unlink versus mount races are prevented by the
dentry->d_inode->i_mutex of the dentry being removed.
- unlink versus umount races are uninteresting.
- mount versus rename races in testing of d_mountpoint are ignored.
- umount versus rename races in testing of d_mountpoint are ignored.

So comparing this to how I have implemented covered the test is at a
slightly different location in the call path so there may be a slightly
larger race in rename.

For unlink there is a race where the mount could happen after testing
covered. Then the unlink happens. Then we remove the mount with
detach_mounts.


In the context of the symlink attacks against umounting of fuse I don't
see a difference.

In the only case where there is a new race (unlink versus mount) I see
a narrow window where new behavior will happen the unlink will win and
we unmount the filesystem. So there is a vary narrow window in which
we might have a stale entry in /etc/mtab.

So after all of that analysis I don't think we care. If we do care with
a little more work we can pass the mountpoint down and test covered with
dentry->d_inode->i_mutex held, where we test d_mountpoint in 3.12 today.

Eric

Al Viro

unread,
Nov 8, 2013, 4:40:03 PM11/8/13
to
On Fri, Nov 08, 2013 at 12:51:52PM -0800, Eric W. Biederman wrote:

> The return value of d_mountpoint can be obsolete as soon as it returns
> as well, so I don't see this as being significantly different.

Not if the ->i_mutex of that sucker is held. And it *is* held in
vfs_unlink/vfs_rmdir/vfs_rename. Note that we only care about a mountpoint
being falsely assumed to be a non-mountpoint - in the other direction we
can just shrug and say that we'd won the race and got EBUSY for that.

> In 3.12 vfs_rmdir checks d_mountpoint with the
> dentry->d_inode->i_mutex and
> dentry->d_parent->d_inode->i_mutex held.
>
> In 3.12 vfs_unlink checks d_mountpoint with the
> dentry->d_inode->i_mutex and
> dentry->d_parent->d_inode->i_mutex hel.d
>
> In 3.12 vfs_rename_dir and vfs_rename_other checks d_mountpint with the
> target->i_mutex, new_dir->i_mutex, and old_dir->i_mutex held.
>
>
> Therefore the guarantees in 3.12 are:
> - unlink versus mount races are prevented by the
> dentry->d_inode->i_mutex of the dentry being removed.
> - unlink versus umount races are uninteresting.
> - mount versus rename races in testing of d_mountpoint are ignored.

Read what you've written a few lines above. The part about target->i_mutex
being held.

> So comparing this to how I have implemented covered the test is at a
> slightly different location in the call path so there may be a slightly
> larger race in rename.

You've got a race in unlink. You've got a race in rename. You've got a race
in rmdir. And none of those had that race in 3.12 (including rename()).

BTW, could you describe the races with umount in a bit more details? Races
with mount are simple - rmdir() sees that victim isn't a mountpoint and
proceeds, mount() sees that victim is still alive and proceeds, despite
the fact that victim is irretrievably on the way to removal. And that's
what ->i_mutex on victim prevents, making "check for d_mountpoint / remove /
call dont_mount()" atomic wrt mount(). What is the problem you are seeing
with umount()? rmdir() getting EBUSY because it hasn't noticed umount()
happening in parallel with it? Legitimate behaviour, as far I can see...
Or is it about something different?

Eric W. Biederman

unread,
Nov 8, 2013, 5:20:04 PM11/8/13
to
Al Viro <vi...@ZenIV.linux.org.uk> writes:

> On Fri, Nov 08, 2013 at 12:51:52PM -0800, Eric W. Biederman wrote:
>
>> The return value of d_mountpoint can be obsolete as soon as it returns
>> as well, so I don't see this as being significantly different.
>
> Not if the ->i_mutex of that sucker is held. And it *is* held in
> vfs_unlink/vfs_rmdir/vfs_rename. Note that we only care about a mountpoint
> being falsely assumed to be a non-mountpoint - in the other direction we
> can just shrug and say that we'd won the race and got EBUSY for that.

I wasn't certain of your question. My point here was that covered() as
a mechanism is as good as d_mountpoint. So the only potential issue
with covered() as a mechanism is where covered() is called.

Also please note old_dentry->d_inode->i_mutex is not held in rename.

>> In 3.12 vfs_rmdir checks d_mountpoint with the
>> dentry->d_inode->i_mutex and
>> dentry->d_parent->d_inode->i_mutex held.
>>
>> In 3.12 vfs_unlink checks d_mountpoint with the
>> dentry->d_inode->i_mutex and
>> dentry->d_parent->d_inode->i_mutex hel.d
>>
>> In 3.12 vfs_rename_dir and vfs_rename_other checks d_mountpint with the
>> target->i_mutex, new_dir->i_mutex, and old_dir->i_mutex held.
>>
>>
>> Therefore the guarantees in 3.12 are:
>> - unlink versus mount races are prevented by the
>> dentry->d_inode->i_mutex of the dentry being removed.
>> - unlink versus umount races are uninteresting.
>> - mount versus rename races in testing of d_mountpoint are ignored.
>
> Read what you've written a few lines above. The part about target->i_mutex
> being held.

That works for the rename as unlink case but we don't hold
old_dentry->d_inode->i_mutex which is what is needed to prevent a mount
on the dentry we are renaming.

>> So comparing this to how I have implemented covered the test is at a
>> slightly different location in the call path so there may be a slightly
>> larger race in rename.
>
> You've got a race in unlink. You've got a race in rename. You've got a race
> in rmdir. And none of those had that race in 3.12 (including rename()).

Rename absolutely has a race in 3.12. With very lucky timing it is
possible to mount something on directory a, simultaneously rename
a to b, and have the mount show up on b.

> BTW, could you describe the races with umount in a bit more details? Races
> with mount are simple - rmdir() sees that victim isn't a mountpoint and
> proceeds, mount() sees that victim is still alive and proceeds, despite
> the fact that victim is irretrievably on the way to removal. And that's
> what ->i_mutex on victim prevents, making "check for d_mountpoint / remove /
> call dont_mount()" atomic wrt mount(). What is the problem you are seeing
> with umount()? rmdir() getting EBUSY because it hasn't noticed umount()
> happening in parallel with it? Legitimate behaviour, as far I can see...
> Or is it about something different?

I did not say it was a problem only that it was a race. The only case I
can see is getting a state EBUSY, and I see no problem with a that.

Eric

Christoph Hellwig

unread,
Nov 9, 2013, 4:00:02 AM11/9/13
to
On Fri, Nov 08, 2013 at 02:17:31PM -0800, Eric W. Biederman wrote:
> > Read what you've written a few lines above. The part about target->i_mutex
> > being held.
>
> That works for the rename as unlink case but we don't hold
> old_dentry->d_inode->i_mutex which is what is needed to prevent a mount
> on the dentry we are renaming.

It will be held in 3.13.

Eric W. Biederman

unread,
Nov 21, 2013, 4:00:01 PM11/21/13
to
Christoph Hellwig <h...@infradead.org> writes:

> On Fri, Nov 08, 2013 at 02:17:31PM -0800, Eric W. Biederman wrote:
>> > Read what you've written a few lines above. The part about target->i_mutex
>> > being held.
>>
>> That works for the rename as unlink case but we don't hold
>> old_dentry->d_inode->i_mutex which is what is needed to prevent a mount
>> on the dentry we are renaming.
>
> It will be held in 3.13.

Only for files, not for directories. And none of those locks turns out
to be good enough today to prevent the races between mount and rename.
With the result that when mount returns your mount point could be
located just about anywhere, and that is just considering renames of the
actual mountpoint itself.

Eric

Eric W. Biederman

unread,
Nov 21, 2013, 4:00:02 PM11/21/13
to
Al Viro <vi...@ZenIV.linux.org.uk> writes:

> On Tue, Oct 15, 2013 at 01:16:48PM -0700, Eric W. Biederman wrote:
>
>> int vfs_rmdir(struct inode *dir, struct dentry *dentry)
>> {
>> int error = may_delete(dir, dentry, 1);
>> @@ -3622,6 +3636,9 @@ retry:
>> error = -ENOENT;
>> goto exit3;
>> }
>> + error = -EBUSY;
>> + if (covered(nd.path.mnt, dentry))
>> + goto exit3;
>
> Ugh... And it's not racy because of...? IOW, what's to keep the return
> value of covered() from getting obsolete just as it's being calculated,
> let alone returned?

I have been fighting a cold off and on so I have been taking much longer
to dig through all of these issues than I would like.

Aftering having thought through all of the issues I completely agree
that this is a racy bug that needs to be fixed.

The fix needs to be holding i_mutex of the parent directory in do_mount,
and pivot_root.

We need to hold i_mutex in do_mount and pivot_root not because of this
issue but to prevent mount points being renamed before we mount on them.

With todays kernel because of races between when we lookup a mount point
and when we take locks, and which locks we take. When mount(2) returns
the mount point can be located anywhere. Which completely defeats
returning -EBUSY mount points from a userspace semantics perspective.

I was really hoping I could think through this and say that this was a
trivial issue that would allow my patches to be good for 3.13. But that
is clearly not the case. kern_path_locked(...,LOOKUP_FOLLOW,...) is
non-trivial to implement, and there are issues like having to move
get_fs_type before we take any locks to prevent deadlocks.

I almost have the issues worked through, so hopefully I can send a
rebased set of patches in a few days.

Eric

Eric W. Biederman

unread,
Feb 15, 2014, 4:40:01 PM2/15/14
to

The current comments in d_invalidate about what and why it is doing
what it is doing are wildly off-base. Which is not surprising as
the comments date back to last minute bug fix of the 2.2 kernel.

The big fat lie of a comment said: If it's a directory, we can't drop
it for fear of somebody re-populating it with children (even though
dropping it would make it unreachable from that root, we still might
repopulate it if it was a working directory or similar).

The truth is that for remote filesystems the failure of d_revalidate
and d_weak_revalidate prevents us from populating or otherwise
inappropriately using a directory. For local filesystems and for
local directory removals the setting of S_DEAD prevents us from
populating or otherwise inappropriate using a directory.

The current rules are:
- To prevent mount point leaks dentries that are mount points or that
have childrent that are mount points may not be be unhashed.
- All dentries may be unhashed.
- Directories may be rehashed with d_materialise_unique

check_submounts_and_drop implements this already for well maintained
remote filesystems so implement the current rules in d_invalidate
by just calling check_submounts_and_drop.

The one difference between d_invalidate and check_submounts_and_drop
is that d_invalidate must respect it when a d_revalidate method has
earlier called d_drop so preserve the d_unhashed check in
d_invalidate.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/dcache.c | 38 ++++----------------------------------
1 files changed, 4 insertions(+), 34 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c71c86f7780e..b0add629f5fe 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -613,9 +613,8 @@ EXPORT_SYMBOL(dput);
* @dentry: dentry to invalidate
*
* Try to invalidate the dentry if it turns out to be
- * possible. If there are other dentries that can be
- * reached through this one we can't delete it and we
- * return -EBUSY. On success we return 0.
+ * possible. If there are reasons not to delete it
+ * return -EBUSY. On success return 0.
*
* no dcache lock.
*/
@@ -630,38 +629,9 @@ int d_invalidate(struct dentry * dentry)
spin_unlock(&dentry->d_lock);
return 0;
}
- /*
- * Check whether to do a partial shrink_dcache
- * to get rid of unused child entries.
- */
- if (!list_empty(&dentry->d_subdirs)) {
- spin_unlock(&dentry->d_lock);
- shrink_dcache_parent(dentry);
- spin_lock(&dentry->d_lock);
- }
-
- /*
- * Somebody else still using it?
- *
- * If it's a directory, we can't drop it
- * for fear of somebody re-populating it
- * with children (even though dropping it
- * would make it unreachable from the root,
- * we might still populate it if it was a
- * working directory or similar).
- * We also need to leave mountpoints alone,
- * directory or not.
- */
- if (dentry->d_lockref.count > 1 && dentry->d_inode) {
- if (S_ISDIR(dentry->d_inode->i_mode) || d_mountpoint(dentry)) {
- spin_unlock(&dentry->d_lock);
- return -EBUSY;
- }
- }
-
- __d_drop(dentry);
spin_unlock(&dentry->d_lock);
- return 0;
+
+ return check_submounts_and_drop(dentry);
}
EXPORT_SYMBOL(d_invalidate);

--
1.7.5.4

Eric W. Biederman

unread,
Feb 15, 2014, 4:40:02 PM2/15/14
to

v2: Always drop the lock when exiting early.
v3: Make detach_mounts robust about freeing several
mounts on the same mountpoint at one time, and remove
the unneeded mnt_list list test.
v4: Document the purpose of detach_mounts and why new_mountpoint is
safe to call.

Signed-off-by: Eric W. Biederman <ebied...@twitter.com>
---
fs/mount.h | 2 ++
fs/namespace.c | 39 +++++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 50a72d46e7a6..2b470f34e665 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -84,6 +84,8 @@ extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);

extern bool legitimize_mnt(struct vfsmount *, unsigned);

+extern void detach_mounts(struct dentry *dentry);
+
static inline void get_mnt_ns(struct mnt_namespace *ns)
{
atomic_inc(&ns->count);
diff --git a/fs/namespace.c b/fs/namespace.c
index 33db9e95bd5c..7abbf722ce18 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1359,6 +1359,45 @@ static int do_umount(struct mount *mnt, int flags)
return retval;
}

+/*
+ * detach_mounts - lazily unmount all mounts on the specified dentry
+ *
+ * During unlink, rmdir, and d_drop it is possible to loose the path
+ * to an existing mountpoint, and wind up leaking the mount.
+ * detach_mounts allows lazily unmounting those mounts instead of
+ * leaking them.
+ *
+ * The caller may hold dentry->d_inode->i_mutex.
+ */
+void detach_mounts(struct dentry *dentry)
+{
+ struct mountpoint *mp;
+ struct mount *mnt;
+
+ namespace_lock();
+ if (!d_mountpoint(dentry))
+ goto out_unlock;
+
+ /*
+ * The namespace lock and d_mountpoint being set guarantees
+ * that new_mountpoint will just be a lookup of the existing
+ * mountpoint structure.
+ */
+ mp = new_mountpoint(dentry);
+ if (IS_ERR(mp))
+ goto out_unlock;
+
+ lock_mount_hash();
+ while (!list_empty(&mp->m_list)) {
+ mnt = list_first_entry(&mp->m_list, struct mount, mnt_mp_list);
+ umount_tree(mnt, 1);
+ }
+ unlock_mount_hash();
+ put_mountpoint(mp);
+out_unlock:
+ namespace_unlock();
+}
+
/*
* Is the caller allowed to modify his namespace?
*/

Eric W. Biederman

unread,
Feb 15, 2014, 4:40:02 PM2/15/14
to

Now that d_invalidate is the only caller of check_submounts_and_drop,
expand check_submounts_and_drop inline in d_invalidate.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/dcache.c | 55 +++++++++++++++++++----------------------------
include/linux/dcache.h | 1 -
2 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 27585b1dd6f1..5b41205cbf33 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -608,32 +608,6 @@ kill_it:
}
EXPORT_SYMBOL(dput);

-/**
- * d_invalidate - invalidate a dentry
- * @dentry: dentry to invalidate
- *
- * Try to invalidate the dentry if it turns out to be
- * possible. If there are reasons not to delete it
- * return -EBUSY. On success return 0.
- *
- * no dcache lock.
- */
-
-int d_invalidate(struct dentry * dentry)
-{
- /*
- * If it's already been dropped, return OK.
- */
- spin_lock(&dentry->d_lock);
- if (d_unhashed(dentry)) {
- spin_unlock(&dentry->d_lock);
- return 0;
- }
- spin_unlock(&dentry->d_lock);
-
- return check_submounts_and_drop(dentry);
-}
-EXPORT_SYMBOL(d_invalidate);

/* This must be called with d_lock held */
static inline void __dget_dlock(struct dentry *dentry)
@@ -1175,7 +1149,7 @@ EXPORT_SYMBOL(have_submounts);
* reachable (e.g. NFS can unhash a directory dentry and then the complete
* subtree can become unreachable).
*
- * Only one of check_submounts_and_drop() and d_set_mounted() must succeed. For
+ * Only one of d_invalidate() and d_set_mounted() must succeed. For
* this reason take rename_lock and d_lock on dentry and ancestors.
*/
int d_set_mounted(struct dentry *dentry)
@@ -1184,7 +1158,7 @@ int d_set_mounted(struct dentry *dentry)
int ret = -ENOENT;
write_seqlock(&rename_lock);
for (p = dentry->d_parent; !IS_ROOT(p); p = p->d_parent) {
- /* Need exclusion wrt. check_submounts_and_drop() */
+ /* Need exclusion wrt. d_invalidate() */
spin_lock(&p->d_lock);
if (unlikely(d_unhashed(p))) {
spin_unlock(&p->d_lock);
@@ -1400,18 +1374,33 @@ static void check_and_drop(void *_data)
}

/**
- * check_submounts_and_drop - detach submounts, prune dcache, and drop
+ * d_invalidate - detach submounts, prune dcache, and drop
+ * @dentry: dentry to invalidate (aka detach, prune and drop)
+ *
+ * Try to invalidate the dentry if it turns out to be
+ * possible. If there are reasons not to delete it
+ * return -EBUSY. On success return 0.
+ *
+ * no dcache lock.
*
* The final d_drop is done as an atomic operation relative to
* rename_lock ensuring there are no races with d_set_mounted. This
* ensures there are no unhashed dentries on the path to a mountpoint.
- *
- * @dentry: dentry to detach, prune and drop
*/
-int check_submounts_and_drop(struct dentry *dentry)
+int d_invalidate(struct dentry *dentry)
{
int ret = 0;

+ /*
+ * If it's already been dropped, return OK.
+ */
+ spin_lock(&dentry->d_lock);
+ if (d_unhashed(dentry)) {
+ spin_unlock(&dentry->d_lock);
+ return 0;
+ }
+ spin_unlock(&dentry->d_lock);
+
/* Negative dentries can be dropped without further checks */
if (!dentry->d_inode) {
d_drop(dentry);
@@ -1445,7 +1434,7 @@ int check_submounts_and_drop(struct dentry *dentry)
out:
return ret;
}
-EXPORT_SYMBOL(check_submounts_and_drop);
+EXPORT_SYMBOL(d_invalidate);

/**
* __d_alloc - allocate a dcache entry
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index bf72e9ac6de0..ae77222c3e86 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -265,7 +265,6 @@ extern void d_prune_aliases(struct inode *);

/* test whether we have any submounts in a subdir tree */
extern int have_submounts(struct dentry *);
-extern int check_submounts_and_drop(struct dentry *);

/*
* This adds the entry to the hash queues.

Eric W. Biederman

unread,
Feb 15, 2014, 4:40:03 PM2/15/14
to

Now that check_submounts_and_drop can not fail and is called from
d_invalidate there is no longer a need to call check_submounts_and_drom
from filesystem d_revalidate methods so remove it.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/afs/dir.c | 5 -----
fs/fuse/dir.c | 3 ---
fs/gfs2/dentry.c | 3 ---
fs/kernfs/dir.c | 11 -----------
fs/nfs/dir.c | 4 ----
5 files changed, 0 insertions(+), 26 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 529300327f45..a1645b88fe8a 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -669,7 +669,6 @@ static int afs_d_revalidate(struct dentry *dentry, unsigned int flags)

out_valid:
dentry->d_fsdata = dir_version;
-out_skip:
dput(parent);
key_put(key);
_leave(" = 1 [valid]");
@@ -682,10 +681,6 @@ not_found:
spin_unlock(&dentry->d_lock);

out_bad:
- /* don't unhash if we have submounts */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_skip;
-
_debug("dropping dentry %s/%s",
parent->d_name.name, dentry->d_name.name);
dput(parent);
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 1d1292c581c3..5192d0a04d20 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -273,9 +273,6 @@ out:

invalid:
ret = 0;
-
- if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) != 0)
- ret = 1;
goto out;
}

diff --git a/fs/gfs2/dentry.c b/fs/gfs2/dentry.c
index d3a5d4e29ba5..589f4ea9381c 100644
--- a/fs/gfs2/dentry.c
+++ b/fs/gfs2/dentry.c
@@ -93,9 +93,6 @@ invalid_gunlock:
if (!had_lock)
gfs2_glock_dq_uninit(&d_gh);
invalid:
- if (check_submounts_and_drop(dentry) != 0)
- goto valid;
-
dput(parent);
return 0;

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 5104cf5d25c5..43d5ab905238 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -296,21 +296,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
goto out_bad;

mutex_unlock(&kernfs_mutex);
-out_valid:
return 1;
out_bad:
mutex_unlock(&kernfs_mutex);
out_bad_unlocked:
- /*
- * @dentry doesn't match the underlying kernfs node, drop the
- * dentry and force lookup. If we have submounts we must allow the
- * vfs caches to lie about the state of the filesystem to prevent
- * leaks and other nasty things, so use check_submounts_and_drop()
- * instead of d_drop().
- */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
-
return 0;
}

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index be38b573495a..a8a2f7f86c45 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1143,10 +1143,6 @@ out_zap_parent:
if (IS_ROOT(dentry))
goto out_valid;
}
- /* If we have submounts, don't unhash ! */
- if (check_submounts_and_drop(dentry) != 0)
- goto out_valid;
-
dput(parent);
dfprintk(LOOKUPCACHE, "NFS: %s(%pd2) is invalid\n",
__func__, dentry);

Eric W. Biederman

unread,
Feb 15, 2014, 4:40:03 PM2/15/14
to

In preparation for allowing mountpoints to be renamed and unlinked
in remote filesystems and in other mount namespaces test if on a dentry
there is a mount in the local mount namespace before allowing it to
be renamed or unlinked.

The primary motivation here are old versions of fusermount unmount
which is not safe if the a path can be renamed or unlinked while it is
verifying the mount is safe to unmount. More recent versions are simpler
and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
in a directory owned by an arbitrary user.

Miklos Szeredi <mik...@szeredi.hu> reports this is approach is good
enough to remove concerns about new kernels mixed with old versions
of fusermount.

A secondary motivation for restrictions here is that it removing empty
directories that have non-empty mount points on them appears to
violate the rule that rmdir can not remove empty directories. As
Linus Torvalds pointed out this is useful for programs (like git) that
test if a directory is empty with rmdir.

Therefore this patch arranges to enforce the existing mount point
semantics for local mount namespace.

v2: Rewrote the test to be a drop in replacement for d_mountpoint

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/mount.h | 9 +++++++++
fs/namei.c | 8 +++++++-
fs/namespace.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index a17458ca6f29..17d913d86add 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -109,3 +109,12 @@ struct proc_mounts {
#define proc_mounts(p) (container_of((p), struct proc_mounts, m))

extern const struct seq_operations mounts_op;
+
+extern int __is_local_mountpoint(struct dentry *dentry);
+static inline int is_local_mountpoint(struct dentry *dentry)
+{
+ if (!d_mountpoint(dentry))
+ return 0;
+
+ return __is_local_mountpoint(dentry);
+}
diff --git a/fs/namei.c b/fs/namei.c
index d580df2e6804..4e6fe16ef488 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3507,6 +3507,8 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
mutex_lock(&dentry->d_inode->i_mutex);

error = -EBUSY;
+ if (is_local_mountpoint(dentry))
+ goto out;
if (d_mountpoint(dentry))
goto out;

@@ -3622,7 +3624,7 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
return -EPERM;

mutex_lock(&target->i_mutex);
- if (d_mountpoint(dentry))
+ if (is_local_mountpoint(dentry) || d_mountpoint(dentry))
error = -EBUSY;
else {
error = security_inode_unlink(dir, dentry);
@@ -4001,6 +4003,8 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
mutex_lock(&target->i_mutex);

error = -EBUSY;
+ if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
+ goto out;
if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
goto out;

@@ -4045,6 +4049,8 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
lock_two_nondirectories(source, target);

error = -EBUSY;
+ if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
+ goto out;
if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
goto out;

diff --git a/fs/namespace.c b/fs/namespace.c
index 22e536705c45..6dc23614d44d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -631,6 +631,41 @@ struct vfsmount *lookup_mnt(struct path *path)
return m;
}

+/*
+ * __is_local_mountpoint - Test to see if dentry is a mountpoint in the
+ * current mount namespace.
+ *
+ * The common case is dentries are not mountpoints at all and that
+ * test is handled inline. For the slow case when we are actually
+ * dealing with a mountpoint of some kind, walk through all of the
+ * mounts in the current mount namespace and test to see if the dentry
+ * is a mountpoint.
+ *
+ * The mount_hashtable is not usable in the context because we
+ * need to identify all mounts that may be in the current mount
+ * namespace not just a mount that happens to have some specified
+ * parent mount.
+ */
+int __is_local_mountpoint(struct dentry *dentry)
+{
+ struct mnt_namespace *ns = current->nsproxy->mnt_ns;
+ struct mount *mnt;
+ int is_covered = 0;
+
+ if (!d_mountpoint(dentry))
+ goto out;
+
+ down_read(&namespace_sem);
+ list_for_each_entry(mnt, &ns->list, mnt_list) {
+ is_covered = (mnt->mnt_mountpoint == dentry);
+ if (is_covered)
+ break;
+ }
+ up_read(&namespace_sem);
+out:
+ return is_covered;
+}
+
static struct mountpoint *new_mountpoint(struct dentry *dentry)
{
struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);

Eric W. Biederman

unread,
Feb 15, 2014, 4:40:03 PM2/15/14
to

The last round of this patchset the semantics of dropping mounts on
unlink were agreed upon so long as we preserve the current semantics in
a single mount namespace.

There are two big changes with this version of the patchset.
1) The test for a mount in the current mount namespace has been
rewritten so that it is actually correct and is now a drop in
replacement for d_mountpoint. Removing the need for code changes
that could cause unitended regressions.

2) This work has been comprehensivly extend to d_revalidate and
d_invalidate allowing some long standing bugs to be fixed at the end
of this patch series.

In net this should result in a kernel that is more comprehensible and
maintainble as well as fixing some possible mount leaks, and removing
the possibility of DOS attacks from evily placed mount points, in other
mount namespaces.

Eric W. Biederman (11):
vfs: Document the effect of d_revalidate on d_find_alias
vfs: More precise tests in d_invalidate
vfs: Don't allow overwriting mounts in the current mount namespace
vfs: Keep a list of mounts on a mount point
vfs: Add a function to lazily unmount all mounts from any dentry.
vfs: Lazily remove mounts on unlinked files and directories.
vfs: Remove unnecessary calls of check_submounts_and_drop
vfs: Merge check_submounts_and_drop and d_invalidate
vfs: Make d_invalidate return void
vfs: Remove d_drop calls from d_revalidate implementations
proc: Update proc_flush_task_mnt to use d_invalidate

fs/afs/dir.c | 5 --
fs/btrfs/ioctl.c | 5 +--
fs/ceph/dir.c | 1 -
fs/cifs/readdir.c | 6 +--
fs/dcache.c | 140 +++++++++++++++++-------------------------------
fs/fuse/dir.c | 7 +--
fs/gfs2/dentry.c | 3 -
fs/kernfs/dir.c | 11 ----
fs/mount.h | 13 +++++
fs/namei.c | 28 ++++++----
fs/namespace.c | 80 +++++++++++++++++++++++++++
fs/nfs/dir.c | 7 +--
fs/proc/base.c | 10 +---
fs/proc/fd.c | 2 -
include/linux/dcache.h | 3 +-
15 files changed, 167 insertions(+), 154 deletions(-)

Eric

Eric W. Biederman

unread,
Feb 15, 2014, 4:40:03 PM2/15/14
to

To spot any possible problems call BUG if a mountpoint
is put when it's list of mounts is not empty.

Signed-off-by: Eric W. Biederman <ebied...@twitter.com>
---
fs/mount.h | 2 ++
fs/namespace.c | 6 ++++++
2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index 17d913d86add..50a72d46e7a6 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -21,6 +21,7 @@ struct mnt_pcp {
struct mountpoint {
struct list_head m_hash;
struct dentry *m_dentry;
+ struct list_head m_list;
int m_count;
};

@@ -48,6 +49,7 @@ struct mount {
struct mount *mnt_master; /* slave is on master->mnt_slave_list */
struct mnt_namespace *mnt_ns; /* containing namespace */
struct mountpoint *mnt_mp; /* where is it mounted */
+ struct list_head mnt_mp_list; /* list mounts with the same mountpoint */
#ifdef CONFIG_FSNOTIFY
struct hlist_head mnt_fsnotify_marks;
__u32 mnt_fsnotify_mask;
diff --git a/fs/namespace.c b/fs/namespace.c
index 6dc23614d44d..33db9e95bd5c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -195,6 +195,7 @@ static struct mount *alloc_vfsmnt(const char *name)
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
INIT_LIST_HEAD(&mnt->mnt_slave);
+ INIT_LIST_HEAD(&mnt->mnt_mp_list);
#ifdef CONFIG_FSNOTIFY
INIT_HLIST_HEAD(&mnt->mnt_fsnotify_marks);
#endif
@@ -695,6 +696,7 @@ static struct mountpoint *new_mountpoint(struct dentry *dentry)
mp->m_dentry = dentry;
mp->m_count = 1;
list_add(&mp->m_hash, chain);
+ INIT_LIST_HEAD(&mp->m_list);
return mp;
}

@@ -702,6 +704,7 @@ static void put_mountpoint(struct mountpoint *mp)
{
if (!--mp->m_count) {
struct dentry *dentry = mp->m_dentry;
+ BUG_ON(!list_empty(&mp->m_list));
spin_lock(&dentry->d_lock);
dentry->d_flags &= ~DCACHE_MOUNTED;
spin_unlock(&dentry->d_lock);
@@ -748,6 +751,7 @@ static void detach_mnt(struct mount *mnt, struct path *old_path)
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
list_del_init(&mnt->mnt_child);
list_del_init(&mnt->mnt_hash);
+ list_del_init(&mnt->mnt_mp_list);
put_mountpoint(mnt->mnt_mp);
mnt->mnt_mp = NULL;
}
@@ -764,6 +768,7 @@ void mnt_set_mountpoint(struct mount *mnt,
child_mnt->mnt_mountpoint = dget(mp->m_dentry);
child_mnt->mnt_parent = mnt;
child_mnt->mnt_mp = mp;
+ list_add_tail(&child_mnt->mnt_mp_list, &mp->m_list);
}

/*
@@ -1246,6 +1251,7 @@ void umount_tree(struct mount *mnt, int how)
p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
list_del_init(&p->mnt_child);
if (mnt_has_parent(p)) {
+ list_del_init(&p->mnt_mp_list);
put_mountpoint(p->mnt_mp);
/* move the reference to mountpoint into ->mnt_ex_mountpoint */
p->mnt_ex_mountpoint.dentry = p->mnt_mountpoint;
--
1.7.5.4

Eric W. Biederman

unread,
Feb 15, 2014, 4:40:04 PM2/15/14
to

d_drop or check_submounts_and_drop called from d_revalidate can result
in renamed directories with child dentries being unhashed. These
renamed and drop directory dentries can be rehashed after
d_materialise_unique uses d_find_alias to find them.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/dcache.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 265e0ce9769c..c71c86f7780e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -726,7 +726,8 @@ EXPORT_SYMBOL(dget_parent);
* acquire the reference to alias and return it. Otherwise return NULL.
* Notice that if inode is a directory there can be only one alias and
* it can be unhashed only if it has no children, or if it is the root
- * of a filesystem.
+ * of a filesystem, or if the directory was renamed and d_revalidate
+ * was the first vfs operation to notice.
*
* If the inode has an IS_ROOT, DCACHE_DISCONNECTED alias, then prefer
* any other hashed alias over that one unless @want_discon is set,
--
1.7.5.4

Eric W. Biederman

unread,
Feb 15, 2014, 4:40:04 PM2/15/14
to
v3: Optimized shrink_submounts_and_drop
Removed unsued afs label
v4: Simplified the changes to check_submounts_and_drop
Do not rename check_submounts_and_drop shrink_submounts_and_drop
Document what why we need atomicity in check_submounts_and_drop
Rely on the parent inode mutex to make d_revalidate and d_invalidate
an atomic unit.
v5: Refcount the mountpoint to detach in case of simultaneous
renames.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/dcache.c | 60 ++++++++++++++++++++++++++++++++--------------------------
fs/namei.c | 18 ++++++++--------
2 files changed, 42 insertions(+), 36 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index b0add629f5fe..27585b1dd6f1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1374,36 +1374,39 @@ void shrink_dcache_for_umount(struct super_block *sb)
}
}

-static enum d_walk_ret check_and_collect(void *_data, struct dentry *dentry)
+struct detach_data {
+ struct select_data select;
+ struct dentry *mountpoint;
+};
+static enum d_walk_ret detach_and_collect(void *_data, struct dentry *dentry)
{
- struct select_data *data = _data;
+ struct detach_data *data = _data;

if (d_mountpoint(dentry)) {
- data->found = -EBUSY;
+ __dget_dlock(dentry);
+ data->mountpoint = dentry;
return D_WALK_QUIT;
}

- return select_collect(_data, dentry);
+ return select_collect(&data->select, dentry);
}

static void check_and_drop(void *_data)
{
- struct select_data *data = _data;
+ struct detach_data *data = _data;

- if (d_mountpoint(data->start))
- data->found = -EBUSY;
- if (!data->found)
- __d_drop(data->start);
+ if (!data->mountpoint && !data->select.found)
+ __d_drop(data->select.start);
}

/**
- * check_submounts_and_drop - prune dcache, check for submounts and drop
+ * check_submounts_and_drop - detach submounts, prune dcache, and drop
*
- * All done as a single atomic operation relative to has_unlinked_ancestor().
- * Returns 0 if successfully unhashed @parent. If there were submounts then
- * return -EBUSY.
+ * The final d_drop is done as an atomic operation relative to
+ * rename_lock ensuring there are no races with d_set_mounted. This
+ * ensures there are no unhashed dentries on the path to a mountpoint.
*
- * @dentry: dentry to prune and drop
+ * @dentry: dentry to detach, prune and drop
*/
int check_submounts_and_drop(struct dentry *dentry)
{
@@ -1416,19 +1419,24 @@ int check_submounts_and_drop(struct dentry *dentry)
}

for (;;) {
- struct select_data data;
+ struct detach_data data;

- INIT_LIST_HEAD(&data.dispose);
- data.start = dentry;
- data.found = 0;
+ data.mountpoint = NULL;
+ INIT_LIST_HEAD(&data.select.dispose);
+ data.select.start = dentry;
+ data.select.found = 0;

- d_walk(dentry, &data, check_and_collect, check_and_drop);
- ret = data.found;
+ d_walk(dentry, &data, detach_and_collect, check_and_drop);

- if (!list_empty(&data.dispose))
- shrink_dentry_list(&data.dispose);
+ if (data.select.found)
+ shrink_dentry_list(&data.select.dispose);

- if (ret <= 0)
+ if (data.mountpoint) {
+ detach_mounts(data.mountpoint);
+ dput(data.mountpoint);
+ }
+
+ if (!data.mountpoint && !data.select.found)
break;

cond_resched();
@@ -2640,10 +2648,8 @@ static struct dentry *__d_unalias(struct inode *inode,
goto out_err;
m2 = &alias->d_parent->d_inode->i_mutex;
out_unalias:
- if (likely(!d_mountpoint(alias))) {
- __d_move(alias, dentry);
- ret = alias;
- }
+ __d_move(alias, dentry);
+ ret = alias;
out_err:
spin_unlock(&inode->i_lock);
if (m2)
diff --git a/fs/namei.c b/fs/namei.c
index 4e6fe16ef488..3fca30cd448b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3509,8 +3509,6 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
error = -EBUSY;
if (is_local_mountpoint(dentry))
goto out;
- if (d_mountpoint(dentry))
- goto out;

error = security_inode_rmdir(dir, dentry);
if (error)
@@ -3523,6 +3521,7 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)

dentry->d_inode->i_flags |= S_DEAD;
dont_mount(dentry);
+ detach_mounts(dentry);

out:
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -3624,7 +3623,7 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
return -EPERM;

mutex_lock(&target->i_mutex);
- if (is_local_mountpoint(dentry) || d_mountpoint(dentry))
+ if (is_local_mountpoint(dentry))
error = -EBUSY;
else {
error = security_inode_unlink(dir, dentry);
@@ -3633,8 +3632,10 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry, struct inode **delegate
if (error)
goto out;
error = dir->i_op->unlink(dir, dentry);
- if (!error)
+ if (!error) {
dont_mount(dentry);
+ detach_mounts(dentry);
+ }
}
}
out:
@@ -4005,8 +4006,6 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
error = -EBUSY;
if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
goto out;
- if (d_mountpoint(old_dentry) || d_mountpoint(new_dentry))
- goto out;

error = -EMLINK;
if (max_links && !target && new_dir != old_dir &&
@@ -4022,6 +4021,7 @@ static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry,
if (target) {
target->i_flags |= S_DEAD;
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
}
out:
if (target)
@@ -4051,8 +4051,6 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
error = -EBUSY;
if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry))
goto out;
- if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
- goto out;

error = try_break_deleg(source, delegated_inode);
if (error)
@@ -4066,8 +4064,10 @@ static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry,
if (error)
goto out;

- if (target)
+ if (target) {
dont_mount(new_dentry);
+ detach_mounts(new_dentry);
+ }
if (!(old_dir->i_sb->s_type->fs_flags & FS_RENAME_DOES_D_MOVE))
d_move(old_dentry, new_dentry);
out:
--
1.7.5.4

Eric W. Biederman

unread,
Feb 15, 2014, 4:50:01 PM2/15/14
to

Now that d_invalidate can no longer fail, stop returning a useless
return code. For the few callers that checked the return code update
remove the handling of d_invalidate failure.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/btrfs/ioctl.c | 5 +----
fs/cifs/readdir.c | 6 +-----
fs/dcache.c | 12 +++---------
fs/fuse/dir.c | 4 +---
fs/namei.c | 10 +++++-----
fs/nfs/dir.c | 3 +--
include/linux/dcache.h | 2 +-
7 files changed, 13 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b0134892dc70..349848bd54e2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2229,9 +2229,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
}

mutex_lock(&inode->i_mutex);
- err = d_invalidate(dentry);
- if (err)
- goto out_unlock;
+ d_invalidate(dentry);

down_write(&root->fs_info->subvol_sem);

@@ -2316,7 +2314,6 @@ out_release:
btrfs_subvolume_release_metadata(root, &block_rsv, qgroup_reserved);
out_up_write:
up_write(&root->fs_info->subvol_sem);
-out_unlock:
mutex_unlock(&inode->i_mutex);
if (!err) {
shrink_dcache_sb(root->fs_info->sb);
diff --git a/fs/cifs/readdir.c b/fs/cifs/readdir.c
index b15862e0f68c..d0e9d0169b37 100644
--- a/fs/cifs/readdir.c
+++ b/fs/cifs/readdir.c
@@ -87,8 +87,6 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
return;

if (dentry) {
- int err;
-
inode = dentry->d_inode;
if (inode) {
/*
@@ -105,10 +103,8 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
goto out;
}
}
- err = d_invalidate(dentry);
+ d_invalidate(dentry);
dput(dentry);
- if (err)
- return;
}

/*
diff --git a/fs/dcache.c b/fs/dcache.c
index 5b41205cbf33..5b78bd98649c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1377,27 +1377,21 @@ static void check_and_drop(void *_data)
* d_invalidate - detach submounts, prune dcache, and drop
* @dentry: dentry to invalidate (aka detach, prune and drop)
*
- * Try to invalidate the dentry if it turns out to be
- * possible. If there are reasons not to delete it
- * return -EBUSY. On success return 0.
- *
* no dcache lock.
*
* The final d_drop is done as an atomic operation relative to
* rename_lock ensuring there are no races with d_set_mounted. This
* ensures there are no unhashed dentries on the path to a mountpoint.
*/
-int d_invalidate(struct dentry *dentry)
+void d_invalidate(struct dentry *dentry)
{
- int ret = 0;
-
/*
* If it's already been dropped, return OK.
*/
spin_lock(&dentry->d_lock);
if (d_unhashed(dentry)) {
spin_unlock(&dentry->d_lock);
- return 0;
+ return;
}
spin_unlock(&dentry->d_lock);

@@ -1432,7 +1426,7 @@ int d_invalidate(struct dentry *dentry)
}

out:
- return ret;
+ return;
}
EXPORT_SYMBOL(d_invalidate);

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 5192d0a04d20..6e920fadb45d 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1225,9 +1225,7 @@ static int fuse_direntplus_link(struct file *file,
d_drop(dentry);
} else if (get_node_id(inode) != o->nodeid ||
((o->attr.mode ^ inode->i_mode) & S_IFMT)) {
- err = d_invalidate(dentry);
- if (err)
- goto out;
+ d_invalidate(dentry);
} else if (is_bad_inode(inode)) {
err = -EIO;
goto out;
diff --git a/fs/namei.c b/fs/namei.c
index 3fca30cd448b..384fcc6a5606 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1259,7 +1259,8 @@ static struct dentry *lookup_dcache(struct qstr *name, struct dentry *dir,
if (error < 0) {
dput(dentry);
return ERR_PTR(error);
- } else if (!d_invalidate(dentry)) {
+ } else {
+ d_invalidate(dentry);
dput(dentry);
dentry = NULL;
}
@@ -1391,10 +1392,9 @@ unlazy:
dput(dentry);
return status;
}
- if (!d_invalidate(dentry)) {
- dput(dentry);
- goto need_lookup;
- }
+ d_invalidate(dentry);
+ dput(dentry);
+ goto need_lookup;
}

path->mnt = mnt;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index a8a2f7f86c45..a2b2b6d17c1f 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -464,8 +464,7 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry)
nfs_setsecurity(dentry->d_inode, entry->fattr, entry->label);
goto out;
} else {
- if (d_invalidate(dentry) != 0)
- goto out;
+ d_invalidate(dentry);
dput(dentry);
}
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index ae77222c3e86..95ddb0bdadb0 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -250,7 +250,7 @@ extern struct dentry * d_obtain_alias(struct inode *);
extern void shrink_dcache_sb(struct super_block *);
extern void shrink_dcache_parent(struct dentry *);
extern void shrink_dcache_for_umount(struct super_block *);
-extern int d_invalidate(struct dentry *);
+extern void d_invalidate(struct dentry *);

/* only used at mount-time */
extern struct dentry * d_make_root(struct inode *);
--
1.7.5.4

Eric W. Biederman

unread,
Feb 15, 2014, 4:50:03 PM2/15/14
to

Now that d_invalidate always succeeds it is not longer necessary or
desirable to hard code d_drop calls into filesystem specific
d_revalidate implementations.

Remove the unnecessary d_drop calls and rely on d_invalidate
to drop the dentries. Using d_invalidate ensures that paths
to mount points will not be dropped.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/ceph/dir.c | 1 -
fs/proc/base.c | 4 ----
fs/proc/fd.c | 2 --
3 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 6da4df84ba30..ecd2402739e2 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -1052,7 +1052,6 @@ static int ceph_d_revalidate(struct dentry *dentry, unsigned int flags)
ceph_dentry_lru_touch(dentry);
} else {
ceph_dir_clear_complete(dir);
- d_drop(dentry);
}
iput(dir);
return valid;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 51507065263b..aa62b37f1303 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1654,7 +1654,6 @@ int pid_revalidate(struct dentry *dentry, unsigned int flags)
put_task_struct(task);
return 1;
}
- d_drop(dentry);
return 0;
}

@@ -1791,9 +1790,6 @@ out:
put_task_struct(task);

out_notask:
- if (status <= 0)
- d_drop(dentry);
-
return status;
}

diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 985ea881b5bc..7e4f04ccab31 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -127,8 +127,6 @@ static int tid_fd_revalidate(struct dentry *dentry, unsigned int flags)
}
put_task_struct(task);
}
-
- d_drop(dentry);
return 0;
}

--
1.7.5.4

Eric W. Biederman

unread,
Feb 15, 2014, 4:50:03 PM2/15/14
to

Now that d_invalidate always succeeds and flushes mount points use
it in stead of a combination of shrink_dcache_parent and d_drop
in proc_flush_task_mnt. This removes the danger of a mount point
under /proc/<pid>/... becoming unreachable after the d_drop.

Signed-off-by: "Eric W. Biederman" <ebie...@xmission.com>
---
fs/proc/base.c | 6 ++----
1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index aa62b37f1303..40d97e576d32 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2700,8 +2700,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
/* no ->d_hash() rejects on procfs */
dentry = d_hash_and_lookup(mnt->mnt_root, &name);
if (dentry) {
- shrink_dcache_parent(dentry);
- d_drop(dentry);
+ d_invalidate(dentry);
dput(dentry);
}

@@ -2721,8 +2720,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
name.len = snprintf(buf, sizeof(buf), "%d", pid);
dentry = d_hash_and_lookup(dir, &name);
if (dentry) {
- shrink_dcache_parent(dentry);
- d_drop(dentry);
+ d_invalidate(dentry);
dput(dentry);
}

--
1.7.5.4

Linus Torvalds

unread,
Feb 15, 2014, 6:00:02 PM2/15/14
to
On Sat, Feb 15, 2014 at 1:36 PM, Eric W. Biederman
<ebie...@xmission.com> wrote:
>
> The one difference between d_invalidate and check_submounts_and_drop
> is that d_invalidate must respect it when a d_revalidate method has
> earlier called d_drop so preserve the d_unhashed check in
> d_invalidate.

What?

There's another *huge* difference, namely locking. Your change has
huge detrimental effects wrt d_lock - not only do you drop it for the
local dentry (to then re-take it in check_submounts_and_drop()), but
the whole check_submounts_and_drop thing walks the parent chain and
locks each parent with the renamelock held for writing.

So NAK on this patch, if for no other reason that you are not talking
about the *huge* locking changes at all, and apparently completely
ignored them.

It's possible that you can argue that performance doesn't matter, and
that all the extra huge locking overhead is immaterial because this is
not commonly called, but no way in hell can I accept a patch with a
commit message that basically lies by omission about what it is doing.

This kills the whole series for me. I'm not going to bother trying to
review the rest of the patches when the second patch already makes me
go "Eric is trying to hide things". Because the locking changes aren't
obvious in the patch without knowing what else is going on, and they
certainly aren't mentioned in the commit message.

Linus

Linus Torvalds

unread,
Feb 15, 2014, 6:00:02 PM2/15/14
to
On Sat, Feb 15, 2014 at 2:51 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> the whole check_submounts_and_drop thing walks the parent chain and
> locks each parent with the renamelock held for writing.

Oops, my bad about the write lock, brainfart due to grepping and
reading the wrong context...

check_submounts_and_drop() doesn't do the parent walk with the rename
lock held for writing, it just holds it for reading.

But it does do that very complex "walk parents and check all siblings"
and locks them, so the rest of the commentary was correct.

Eric W. Biederman

unread,
Feb 15, 2014, 6:30:02 PM2/15/14
to
Linus Torvalds <torv...@linux-foundation.org> writes:

> On Sat, Feb 15, 2014 at 2:51 PM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
>>
>> the whole check_submounts_and_drop thing walks the parent chain and
>> locks each parent with the renamelock held for writing.
>
> Oops, my bad about the write lock, brainfart due to grepping and
> reading the wrong context...
>
> check_submounts_and_drop() doesn't do the parent walk with the rename
> lock held for writing, it just holds it for reading.
>
> But it does do that very complex "walk parents and check all siblings"
> and locks them, so the rest of the commentary was correct.

Except that today d_invalidate drops the dcache lock and
calls shrink_dcache_parent. Which gets you into exactly the same
complex "walk parents and check all siblings" code.

The only difference between the shrink_dcache_parent and
check_submounts_and_drop (not counting the final drop)
is that check_submounts_and_drop aborts when it encounters a dentry
with d_mountpoint set.


So no I am not trying to hide something. I called out that I changed
this logic in particular and this particular patch all I am doing is
killing the enforcing of 2.2 era logic. Further I front loaded this
change so I bisect could point it's fingers at this before any other
substantial changes were made if this is indeed a problem.

Beyond that check_submounts_and_drop is what well maintained distributed
filesystems are calling from d_revalidate.


Now I would not be surprised if this change to d_invalidate is a
challenge to get your head around. It took me a while of reading the
code to realize (a) how the code makes some degree of sense today,
and (b) that the change is semantically safe.


But when shrink_dcache_parent and check_submounts_and_drop are
effectiely the same function I can't possibly see how you can argue how
the locking has changed or that I am trying to hide things.

Eric

Eric W. Biederman

unread,
Feb 15, 2014, 6:40:02 PM2/15/14
to
ebie...@xmission.com (Eric W. Biederman) writes:

> But when shrink_dcache_parent and check_submounts_and_drop are
> effectiely the same function I can't possibly see how you can argue how
> the locking has changed or that I am trying to hide things.

And in particular the only locking change that I can see at all is that
d_walk takes read_seqbegin_or_lock before checking the if the d_subdirs
list is empty, which is just an extra cache line read.

Which in practical terms appears like I have removed unnecessary special
cases in favor less code. Which I think if anything should perform
better just because there is less code to run, and what is happening is
less obfuscated.

Linus Torvalds

unread,
Feb 15, 2014, 7:10:02 PM2/15/14
to
On Sat, Feb 15, 2014 at 3:23 PM, Eric W. Biederman
<ebie...@xmission.com> wrote:
>
> Except that today d_invalidate drops the dcache lock and
> calls shrink_dcache_parent. Which gets you into exactly the same
> complex "walk parents and check all siblings" code.

Hmm. It only does that for directories that have sub-entries, though.

I think you may care just about directories (because that's what your
series is about), but d_invalidate() is used for other cases too,
notably d_revalidate() (ie things like stale NFS lookups of normal
files).

That said, I'll have to think about this more. If d_subdir is empty, I
guess d_walk() will be fairly cheap. It's very different, but maybe
not as disastrous as I thought.

Linus

Eric W. Biederman

unread,
Feb 15, 2014, 8:30:02 PM2/15/14
to
Linus Torvalds <torv...@linux-foundation.org> writes:

> On Sat, Feb 15, 2014 at 3:23 PM, Eric W. Biederman
> <ebie...@xmission.com> wrote:
>>
>> Except that today d_invalidate drops the dcache lock and
>> calls shrink_dcache_parent. Which gets you into exactly the same
>> complex "walk parents and check all siblings" code.
>
> Hmm. It only does that for directories that have sub-entries, though.
>
> I think you may care just about directories (because that's what your
> series is about), but d_invalidate() is used for other cases too,
> notably d_revalidate() (ie things like stale NFS lookups of normal
> files).
>
> That said, I'll have to think about this more. If d_subdir is empty, I
> guess d_walk() will be fairly cheap. It's very different, but maybe
> not as disastrous as I thought.

Thank you for taking the time to look.

It is also very true that I have been mostly focusing on the semantics
and correctness rather than on the performance of these changes.

I did keep a weather eye on the performance impact of these changes
though and as long as we can stand a slow down in the rare case where
mount points are present my changes should be fine.

Eric

Eric W. Biederman

unread,
Feb 15, 2014, 9:00:02 PM2/15/14
to
ebie...@xmission.com (Eric W. Biederman) writes:

> v2: Always drop the lock when exiting early.
> v3: Make detach_mounts robust about freeing several
> mounts on the same mountpoint at one time, and remove
> the unneeded mnt_list list test.
> v4: Document the purpose of detach_mounts and why new_mountpoint is
> safe to call.
>
> Signed-off-by: Eric W. Biederman <ebied...@twitter.com>
> ---
> fs/mount.h | 2 ++
> fs/namespace.c | 39 +++++++++++++++++++++++++++++++++++++++
> 2 files changed, 41 insertions(+), 0 deletions(-)
>
> diff --git a/fs/mount.h b/fs/mount.h
> index 50a72d46e7a6..2b470f34e665 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -84,6 +84,8 @@ extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
>
> extern bool legitimize_mnt(struct vfsmount *, unsigned);
>
> +extern void detach_mounts(struct dentry *dentry);

After Linus's rant about performance I took a second look, and lo and
behold there is a significant optimization oppportunity missed here.

This should almost certainly be:

extern void __detach_mounts(struct dentry *dentry);

static inline void detach_mounts(struct dentry *dentry)
{
if (unlikely(d_mountpoint(dentry)))
return;
__detach_mounts(dentry);
}

Just so I don't take the namespace lock when not necessary.

Although wether this should be inline or at the start of detach_mounts
and out of line I don't know. Sigh. I will have to play with this
detail a bit more.

Eric

Miklos Szeredi

unread,
Feb 18, 2014, 12:20:01 PM2/18/14
to
Minor nit: return value of any is_* function is either true or false, so why not
declare it bool?

> +{
> + struct mnt_namespace *ns = current->nsproxy->mnt_ns;
> + struct mount *mnt;
> + int is_covered = 0;

And the same for this var.

Miklos Szeredi

unread,
Feb 18, 2014, 12:20:02 PM2/18/14
to
Howabout a get_mountpoint(dentry) helper, that returns NULL if it turns out to
be not a mountpoint? And, as an added bonus, you can drop the comment above as
well.

Miklos Szeredi

unread,
Feb 18, 2014, 12:40:01 PM2/18/14
to
You can optimize this by including the negative check within the above d_locked
region and calling __d_drop() instead.

Miklos Szeredi

unread,
Feb 18, 2014, 12:50:02 PM2/18/14
to
On Sat, Feb 15, 2014 at 01:34:43PM -0800, Eric W. Biederman wrote:
>
> The last round of this patchset the semantics of dropping mounts on
> unlink were agreed upon so long as we preserve the current semantics in
> a single mount namespace.
>
> There are two big changes with this version of the patchset.
> 1) The test for a mount in the current mount namespace has been
> rewritten so that it is actually correct and is now a drop in
> replacement for d_mountpoint. Removing the need for code changes
> that could cause unitended regressions.
>
> 2) This work has been comprehensivly extend to d_revalidate and
> d_invalidate allowing some long standing bugs to be fixed at the end
> of this patch series.
>
> In net this should result in a kernel that is more comprehensible and
> maintainble as well as fixing some possible mount leaks, and removing
> the possibility of DOS attacks from evily placed mount points, in other
> mount namespaces.

This patchset looks good to me. You can add

Reviewed-by: Miklos Szeredi <mik...@szeredi.hu>

Thanks,
Miklos

Eric W. Biederman

unread,
Feb 18, 2014, 4:30:02 PM2/18/14
to

>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index 22e536705c45..6dc23614d44d 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -631,6 +631,41 @@ struct vfsmount *lookup_mnt(struct path *path)
>> return m;
>> }
>>
>> +/*
>> + * __is_local_mountpoint - Test to see if dentry is a mountpoint in the
>> + * current mount namespace.
>> + *
>> + * The common case is dentries are not mountpoints at all and that
>> + * test is handled inline. For the slow case when we are actually
>> + * dealing with a mountpoint of some kind, walk through all of the
>> + * mounts in the current mount namespace and test to see if the dentry
>> + * is a mountpoint.
>> + *
>> + * The mount_hashtable is not usable in the context because we
>> + * need to identify all mounts that may be in the current mount
>> + * namespace not just a mount that happens to have some specified
>> + * parent mount.
>> + */
>> +int __is_local_mountpoint(struct dentry *dentry)
>
> Minor nit: return value of any is_* function is either true or false, so why not
> declare it bool?

Because I am working on the core of the kernel and C compilers do weird
things with bool variables (storing them in bytes...). I expected a
type that the C compiler does not do weird things with would be more
readily received on a path whose performance people are interested in.

>> +{
>> + struct mnt_namespace *ns = current->nsproxy->mnt_ns;
>> + struct mount *mnt;
>> + int is_covered = 0;
>
> And the same for this var.
>
>> +
>> + if (!d_mountpoint(dentry))
>> + goto out;
>> +
>> + down_read(&namespace_sem);
>> + list_for_each_entry(mnt, &ns->list, mnt_list) {
>> + is_covered = (mnt->mnt_mountpoint == dentry);
>> + if (is_covered)
>> + break;
>> + }
>> + up_read(&namespace_sem);
>> +out:
>> + return is_covered;
>> +}
>> +
>> static struct mountpoint *new_mountpoint(struct dentry *dentry)
>> {
>> struct list_head *chain = mountpoint_hashtable + hash(NULL, dentry);
>> --
>> 1.7.5.4

Eric
It is loading more messages.
0 new messages