[RFC PATCH 2/2] capabilities: allow nice if we are privileged

Serge Hallyn

unread,

Jul 23, 2013, 2:20:02 PM7/23/13

to

We allow task A to change B's nice level if it has a supserset of
B's privileges, or of it has CAP_SYS_NICE. Also allow it if A has
CAP_SYS_NICE with respect to B - meaning it is root in the same
namespace, or it created B's namespace.

Signed-off-by: Serge Hallyn <serge....@canonical.com>
---
security/commoncap.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index d78b003..ef98b56 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -768,16 +768,16 @@ int cap_task_fix_setuid(struct cred *new, const struct cred *old, int flags)
*/
static int cap_safe_nice(struct task_struct *p)
{
- int is_subset;
+ int is_subset, ret = 0;

rcu_read_lock();
is_subset = cap_issubset(__task_cred(p)->cap_permitted,
current_cred()->cap_permitted);
+ if (!is_subset && !ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE))
+ ret = -EPERM;
rcu_read_unlock();

- if (!is_subset && !capable(CAP_SYS_NICE))
- return -EPERM;
- return 0;
+ return ret;
}

/**
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Serge Hallyn

unread,

Jul 23, 2013, 2:20:02 PM7/23/13

to

We allow a task to change its own devices cgroup, or to change other tasks'
cgroups if it has CAP_SYS_ADMIN.

Also allow task A to change task B's cgroup if task A has CAP_SYS_ADMIN
with respect to task B - meaning A is root in the same userns, or A
created B's userns.

Signed-off-by: Serge Hallyn <serge....@canonical.com>
---

security/device_cgroup.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 9760ecb6..8f5386ea 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -74,11 +74,14 @@ struct cgroup_subsys devices_subsys;
static int devcgroup_can_attach(struct cgroup *new_cgrp,
struct cgroup_taskset *set)
{
- struct task_struct *task = cgroup_taskset_first(set);
+ struct task_struct *t = cgroup_taskset_first(set);
+ int ret = 0;

- if (current != task && !capable(CAP_SYS_ADMIN))

- return -EPERM;
- return 0;

+ rcu_read_lock();
+ if (current != t && !ns_capable(__task_cred(t)->user_ns, CAP_SYS_ADMIN))
+ ret = -EPERM;
+ rcu_read_unlock();
+ return ret;
}

/*

Tejun Heo

unread,

Jul 23, 2013, 2:40:02 PM7/23/13

to

On Tue, Jul 23, 2013 at 01:16:06PM -0500, Serge Hallyn wrote:
> We allow a task to change its own devices cgroup, or to change other tasks'
> cgroups if it has CAP_SYS_ADMIN.
>
> Also allow task A to change task B's cgroup if task A has CAP_SYS_ADMIN
> with respect to task B - meaning A is root in the same userns, or A
> created B's userns.

As discussed multpile times, cgroup isn't gonna support delegating
cgroup management directly into containers, so this doesn't really
jive with where we're heading.

Thanks.

--
tejun

Serge Hallyn

unread,

Jul 23, 2013, 2:40:02 PM7/23/13

to

Quoting Tejun Heo (t...@kernel.org):
> On Tue, Jul 23, 2013 at 01:16:06PM -0500, Serge Hallyn wrote:
> > We allow a task to change its own devices cgroup, or to change other tasks'
> > cgroups if it has CAP_SYS_ADMIN.
> >
> > Also allow task A to change task B's cgroup if task A has CAP_SYS_ADMIN
> > with respect to task B - meaning A is root in the same userns, or A
> > created B's userns.
>
> As discussed multpile times, cgroup isn't gonna support delegating
> cgroup management directly into containers, so this doesn't really
> jive with where we're heading.

This doesn't delegate it into the container. It allows me, on the host,
to set the cgroup for a container.

thanks,
-serge

Tejun Heo

unread,

Jul 23, 2013, 3:00:01 PM7/23/13

to

On Tue, Jul 23, 2013 at 2:38 PM, Serge Hallyn <serge....@ubuntu.com> wrote:
> This doesn't delegate it into the container. It allows me, on the host,
> to set the cgroup for a container.

Hmmm? I'm a bit confused. Isn't the description saying that the patch
allows pseudo-root in userns to change cgroup membership even if it
isn't actually root?

Besides, I find the whole check rather bogus and would actually much
prefer just nuking the check and just follow the standard permission
checks.

Thanks.

--
tejun

Serge Hallyn

unread,

Jul 23, 2013, 3:10:02 PM7/23/13

to

Quoting Tejun Heo (t...@kernel.org):
> On Tue, Jul 23, 2013 at 2:38 PM, Serge Hallyn <serge....@ubuntu.com> wrote:
> > This doesn't delegate it into the container. It allows me, on the host,
> > to set the cgroup for a container.
>
> Hmmm? I'm a bit confused. Isn't the description saying that the patch
> allows pseudo-root in userns to change cgroup membership even if it
> isn't actually root?

If task A is uid 1000 on the host, and creates task B as uid X in a new
user namespace, then task A, still being uid 1000 on the host, is
privileged with respect to B and his namespace - i.e.
ns_capable(B->userns, CAP_SYS_ADMIN) is true.

> Besides, I find the whole check rather bogus and would actually much
> prefer just nuking the check and just follow the standard permission
> checks.

I'd be ok with that - but there's one case I'm not sure about: If PAM
sets me up with /sys/fs/cgroup/devices/serge owned by me, then if I'm
thinking right, removing can_attach would mean I could move init into
/sys/fs/cgroup/devices/serge...

Is there something else stopping that from happening?

-serge

Tejun Heo

unread,

Jul 23, 2013, 3:20:02 PM7/23/13

to

Hello,

On Tue, Jul 23, 2013 at 02:04:26PM -0500, Serge Hallyn wrote:
> If task A is uid 1000 on the host, and creates task B as uid X in a new
> user namespace, then task A, still being uid 1000 on the host, is
> privileged with respect to B and his namespace - i.e.
> ns_capable(B->userns, CAP_SYS_ADMIN) is true.

Well, that also is the exact type of priv delegation we're moving away
from, so....

> > Besides, I find the whole check rather bogus and would actually much
> > prefer just nuking the check and just follow the standard permission
> > checks.
>
> I'd be ok with that - but there's one case I'm not sure about: If PAM
> sets me up with /sys/fs/cgroup/devices/serge owned by me, then if I'm
> thinking right, removing can_attach would mean I could move init into
> /sys/fs/cgroup/devices/serge...
>
> Is there something else stopping that from happening?

If PAM is giving out perms on cgroup directory, the whole system is
prone to DoS in various ways anyway. It's already utterly broken, so
kinda moot point. If there are people actually doing that in the
wild, we can conditionalize it on cgroup_sane_behavior().

Thanks.

--
tejun

Serge Hallyn

unread,

Jul 23, 2013, 3:30:02 PM7/23/13

to

Quoting Tejun Heo (t...@kernel.org):
> Hello,
>
> On Tue, Jul 23, 2013 at 02:04:26PM -0500, Serge Hallyn wrote:
> > If task A is uid 1000 on the host, and creates task B as uid X in a new
> > user namespace, then task A, still being uid 1000 on the host, is
> > privileged with respect to B and his namespace - i.e.
> > ns_capable(B->userns, CAP_SYS_ADMIN) is true.
>
> Well, that also is the exact type of priv delegation we're moving away
> from, so....

I think that's unreasonable, but I guess I'll have to go reread the
old thread.

> > > Besides, I find the whole check rather bogus and would actually much
> > > prefer just nuking the check and just follow the standard permission
> > > checks.
> >
> > I'd be ok with that - but there's one case I'm not sure about: If PAM
> > sets me up with /sys/fs/cgroup/devices/serge owned by me, then if I'm
> > thinking right, removing can_attach would mean I could move init into
> > /sys/fs/cgroup/devices/serge...
> >
> > Is there something else stopping that from happening?
>
> If PAM is giving out perms on cgroup directory, the whole system is
> prone to DoS in various ways anyway. It's already utterly broken, so

If we have decent enforcement of hierarchy for devices.{allow,deny},
which we now do, then I don't see why this has to be the case.

> kinda moot point. If there are people actually doing that in the
> wild, we can conditionalize it on cgroup_sane_behavior().

Guess we'll stop using cgroups for now.

-serge

Tejun Heo

unread,

Jul 23, 2013, 3:40:01 PM7/23/13

to

Hello, Serge.

On Tue, Jul 23, 2013 at 3:28 PM, Serge Hallyn <serge....@ubuntu.com> wrote:
>> Well, that also is the exact type of priv delegation we're moving away
>> from, so....
>
> I think that's unreasonable, but I guess I'll have to go reread the
> old thread.

Yeah, please do. I think the case is pretty strong for disallowing
delegation of cgroup directories to !root (or whatever CAP it should
be) users. It's inherently unsafe for some controllers and ends up
leaking kernel implementation details into regular binaries thus
cementing those leaked details as APIs, which is a giant no-no.

> If we have decent enforcement of hierarchy for devices.{allow,deny},
> which we now do, then I don't see why this has to be the case.

If you think about devcg in isolation, maybe, but please keep in mind
that devcg itself is already somewhat abusing cgroup and many other
controllers shouldn't be allowed to delegate and we're headed for one
unified hierarchy.

>> kinda moot point. If there are people actually doing that in the
>> wild, we can conditionalize it on cgroup_sane_behavior().
>
> Guess we'll stop using cgroups for now.

If you're delegating cgroup accesses to !root users, yes, STOP,
please. It'd be doing a lot more harm to the whole kernel and its
maintainability than whatever extra features / benefits it may be
bringing.

Thanks.

--
tejun

Eric W. Biederman

unread,

Jul 24, 2013, 4:10:02 AM7/24/13

to

Serge Hallyn <serge....@ubuntu.com> writes:

> We allow task A to change B's nice level if it has a supserset of
> B's privileges, or of it has CAP_SYS_NICE. Also allow it if A has
> CAP_SYS_NICE with respect to B - meaning it is root in the same
> namespace, or it created B's namespace.

This patch looks good to me.

We already have this logic elsewhere in the kernel so I don't expect
this will make things worse.

Reviewed-by: "Eric W. Biederman" <ebie...@xmission.com>

Serge E. Hallyn

unread,

Oct 22, 2013, 8:50:02 PM10/22/13

to

Quoting Tejun Heo (t...@kernel.org):
> On Tue, Jul 23, 2013 at 2:38 PM, Serge Hallyn <serge....@ubuntu.com> wrote:
> > This doesn't delegate it into the container. It allows me, on the host,
> > to set the cgroup for a container.
>
> Hmmm? I'm a bit confused. Isn't the description saying that the patch
> allows pseudo-root in userns to change cgroup membership even if it
> isn't actually root?
>
> Besides, I find the whole check rather bogus and would actually much
> prefer just nuking the check and just follow the standard permission
> checks.

Can we please nuke it like this then?

From b840083ec8fa1f0645ae925c79db3dc51edd019c Mon Sep 17 00:00:00 2001
From: Serge Hallyn <serge....@ubuntu.com>
Date: Wed, 23 Oct 2013 01:34:00 +0200
Subject: [PATCH 1/1] device_cgroup: remove can_attach

It is really only wanting to duplicate a check which is already done by the
cgroup subsystem.

With this patch, user jdoe still cannot move pid 1 into a devices cgroup
he owns, but now he can move his own other tasks into devices cgroups.

Signed-off-by: Serge Hallyn <serge....@ubuntu.com>
Cc: Aristeu Rozanski <ar...@redhat.com>
Cc: Tejun Heo <t...@kernel.org>
---
security/device_cgroup.c | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index efc6f68..a37c054 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -64,16 +64,6 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)

struct cgroup_subsys devices_subsys;

-static int devcgroup_can_attach(struct cgroup_subsys_state *new_css,
- struct cgroup_taskset *set)
-{

- struct task_struct *task = cgroup_taskset_first(set);

-

- if (current != task && !capable(CAP_SYS_ADMIN))
- return -EPERM;
- return 0;

-}
-
/*
* called under devcgroup_mutex
*/
@@ -698,7 +688,6 @@ static struct cftype dev_cgroup_files[] = {

struct cgroup_subsys devices_subsys = {
.name = "devices",
- .can_attach = devcgroup_can_attach,
.css_alloc = devcgroup_css_alloc,
.css_free = devcgroup_css_free,
.css_online = devcgroup_online,
--
1.8.3.2

Tejun Heo

unread,

Oct 24, 2013, 7:00:02 AM10/24/13

to

On Wed, Oct 23, 2013 at 12:41:30AM +0000, Serge E. Hallyn wrote:
> Quoting Tejun Heo (t...@kernel.org):
> > On Tue, Jul 23, 2013 at 2:38 PM, Serge Hallyn <serge....@ubuntu.com> wrote:
> > > This doesn't delegate it into the container. It allows me, on the host,
> > > to set the cgroup for a container.
> >
> > Hmmm? I'm a bit confused. Isn't the description saying that the patch
> > allows pseudo-root in userns to change cgroup membership even if it
> > isn't actually root?
> >
> > Besides, I find the whole check rather bogus and would actually much
> > prefer just nuking the check and just follow the standard permission
> > checks.
>
> Can we please nuke it like this then?
>
> From b840083ec8fa1f0645ae925c79db3dc51edd019c Mon Sep 17 00:00:00 2001
> From: Serge Hallyn <serge....@ubuntu.com>
> Date: Wed, 23 Oct 2013 01:34:00 +0200
> Subject: [PATCH 1/1] device_cgroup: remove can_attach
>
> It is really only wanting to duplicate a check which is already done by the
> cgroup subsystem.
>
> With this patch, user jdoe still cannot move pid 1 into a devices cgroup
> he owns, but now he can move his own other tasks into devices cgroups.
>
> Signed-off-by: Serge Hallyn <serge....@ubuntu.com>
> Cc: Aristeu Rozanski <ar...@redhat.com>
> Cc: Tejun Heo <t...@kernel.org>

Applied to cgroup/for-3.13. Thanks.

--
tejun

Serge E. Hallyn

unread,

Nov 4, 2013, 5:00:03 PM11/4/13

to

Quoting Tejun Heo (t...@kernel.org):
> Hello, Serge.
>
> On Tue, Jul 23, 2013 at 3:28 PM, Serge Hallyn <serge....@ubuntu.com> wrote:

> > On Tue, Jul 23, 2013 at 02:04:26PM -0500, Serge Hallyn wrote:
> > > If task A is uid 1000 on the host, and creates task B as uid X in a new
> > > user namespace, then task A, still being uid 1000 on the host, is
> > > privileged with respect to B and his namespace - i.e.
> > > ns_capable(B->userns, CAP_SYS_ADMIN) is true.
>

> >> Well, that also is the exact type of priv delegation we're moving away
> >> from, so....
> >
> > I think that's unreasonable, but I guess I'll have to go reread the
> > old thread.
>
> Yeah, please do. I think the case is pretty strong for disallowing
> delegation of cgroup directories to !root (or whatever CAP it should
> be) users. It's inherently unsafe for some controllers and ends up
> leaking kernel implementation details into regular binaries thus
> cementing those leaked details as APIs, which is a giant no-no.

Hi Tejun,

So to come back to this... At plumbers, I believe you were not present
at the 'let me contain that for you' session. There the "delegation is
not safe" topic came up. When pressed, the only example that came up
was (IIRC) having to do with an rt scheduling issue which, all agreed,
was a bug in the implementation which should be fixed, rather than a valid
reason to say that delegation of cgroups should be fundamentally not supported.

Do you have a list of such issues which you see with delegation? That is,
cases where, if ownership of a subtree is granted to a non-root user,
that user can affect tasks owned by other users who are in other
cgroups?

-serge

Tejun Heo

unread,

Nov 4, 2013, 5:10:02 PM11/4/13

to

Hello,

On Mon, Nov 04, 2013 at 09:51:35PM +0000, Serge E. Hallyn wrote:
> Do you have a list of such issues which you see with delegation? That is,
> cases where, if ownership of a subtree is granted to a non-root user,
> that user can affect tasks owned by other users who are in other
> cgroups?

A lot of security is about logistics and cgroup simply doesn't have
them - depth, number of cgroups quota, even config changes or
subdirectory operations which involve RCU operations can easily be
used for DoS attacks. Just think about how much complexity and effort
need to be spent on making and maintaining anything properly
delegatable to !priv users. cgroup has never spent such design or
implementation effort - e.g. take a look at how event_control thing is
implemented, it's extremely easy to trigger OOM if you give that out
to !priv users.

cgroup has *never* been safe to give out to !priv users and it is
highly unlikely to be in any foreseeable future. It will be a big new
giant feature which I frankly don't think is worth the risk or effort.
Think of it as giving out sysctl or firewall rule control to !priv
users. Giving out subset of those controls do make sense in terms of
function but we don't do that and don't have infrastructure to support
such usage. cgroup at this stage isn't that different. If you insist
on doing that, you can but it is severely compromising in terms of
security and it'll stay that way for the foreseeable future.

Thanks.

--
tejun