[RFC,PATCH 0/2] dynamic seccomp policies (using BPF filters)

Will Drewry

unread,

Jan 11, 2012, 12:30:03 PM1/11/12

to

The goal of the patchset is straightforward:

To provide a means of reducing the kernel attack surface.

In practice, this is done at the primary kernel ABI: system calls.
Achieving this goal will address the needs expressed by many systems
projects:
qemu/kvm, openssh, vsftpd, lxc, and chromium and chromium os (me).

While system call filtering has been attempted many times, I hope that
this approach shows more promise. It works as described below and in
the patch series.

A userland task may call prctl(PR_ATTACH_SECCOMP_FILTER) to attach a
BPF program to itself. Once attached, all system calls made by the
task will be evaluated by the BPF program prior to being accepted.
Evaluation is done by executing the BPF program over the struct
user_regs_state for the process.

!! If you don't care about background or reasoning, stop reading !!

Past attempts have used:
- bitmap of system call numbers evaluated by seccomp (or tracehooks)
- standalone data structures and extra entry hooks
(cgroups syscall, systrace)
- a collection of ftrace filter strings evaluated by seccomp
- perf_event hackery to allow process termination when an event matches
(or doesn't)

In addition to the publicly posted approaches, I've personally attempted
continued deeper integration with ftrace along a number of different
lines (lead up to that can be found here[1]). What inspired the current
patch series was a number of realizations:
1. Userland knows its ABI - that's how it made the system calls in the
first place.
2. We already exposed a filtering system to userland processes in the
form of BPF and there is continued focus on optimizing evaluation
even after so many years.
3. System call filtering policies should not expose
time-of-check-time-of-use (TOCTOU) vulnerable interfaces but should
expose all the information that may be relevant to a syscall policy
decision.

The prior seccomp-ftrace implementations struggled with very
fixable challenges in ftrace: incomplete syscall coverage,
mismatched syscall names versus unistd, incomplete arch coverage,
etc. These challenges may all be fixed with some time and effort, and
potentially, even closer integration. I explored a number of
alternative approaches from making system call tracepoints per-thread
and "active" to adding a new less-perf-oriented system call.

In the process of experimentation, a number of things became clear:
- perf/ftrace system-wide analysis goals don't align with lightweight
per-thread analysis.
- ftrace/perf ABI doesn't mix well with security policy enforcement,
reduced attack surface environments, or keeping users from specifing
vulnerable filtering policies.
- other than system calls, tracepoints aren't considered ABI-stable.

The core focus of ftrace and perf is to support system-wide
performance and debugging tracing. Despite its amazing flexibility,
there are tradeoffs that are made to provide efficient system-wide
behavior that are less efficient at a per-thread level. For instance,
system call tracepoints are global. It is possible to make them
per-thread (since they use a TIF anyway). However, doing so would mean
that a system-wide system call analysis would require one trace event
per thread rather than one total. It's possible to alleviate that pain,
but that in turn requires more bookkeeping (global versus local
tracepoint registrations mapping to the thread info flag).

Another example is the ftrace ABI. Both the debugfs entry point with
unstable event ids and the perf-oriented perf_event_open(2) are not
suitable to providing a subsystem which is meant to reduce the attack
surface -- much less avoid maintainer flame wars :) The third aspect of
its ABI was also concerning and hints at yet-another-potential struggle.
The ftrace filter language happily accepts globbing and string matching.
This is excellent for tracing, but horrible for system call
interposition. If, despite warning, a user decides that blocking a
system call based on a string is what they want, they can do it. The
result is that their policy may be bypassed due to a time of check, time
of use race. While addressable, it would mean that the filtering engine
would need to allow operation filtering or offer a "secure" subset.

A side challenge that emerged from the desire to enable tracing to act
as a security policy mechanism was the ability to enact policy over more
than just the system calls. While this would be doable if all
tracepoints became active, there is a fundamental problem in that very
little, if any, tracepoints aside from system calls can be considered
stable. If a subset were to emerge as stable, there is still the
challenge of enacting security policy in parallel with tracing policy.
In an example patch where security policy logic was added to
perf_event_open(2), the basics of the system worked, but enforcement of
the security policy was simplistic and intertwined with a large number
of event attributes that were meaningless or altered the behavior.

At every turn, it appears that the tracing infrastructure was unsuited
for being used for attack surface reduction or as a larger security
subsystem on its own. It is well suited for feeding a policy
enforcement mechanism (like seccomp), but not for letting the logic
co-exist. It doesn't mean that it has security problems, just that
there will be a continued struggle between having a really good perf
system and and really good kernel attack surface reduction system if
they were merged. While there may be some distant vision where the
apparent struggle does not exist, I don't see how it would be reached.
Of course, anything is possible with unlimited time. :)

That said, much of that discussion is history and to fill in some of the
gaps since I posted the last ftrace-based patches. This patch series
should stand on its own as both straightforward and effective. In my
opinion, this is the direction I should have taken before I sent my
first patch.

I am looking forward to any and all feedback - thanks!
will

[1] http://search.gmane.org/?query=seccomp+wad%40chromium.org&group=gmane.linux.kernel

Will Drewry (3):
seccomp_filters: dynamic system call filtering using BPF programs
Documentation: prctl/seccomp_filter

Documentation/prctl/seccomp_filter.txt | 179 ++++++++
fs/exec.c | 5 +
include/linux/prctl.h | 3 +
include/linux/seccomp.h | 70 +++++-
kernel/Makefile | 1 +
kernel/fork.c | 4 +
kernel/seccomp.c | 8 +
kernel/seccomp_filter.c | 639 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 4 +
security/Kconfig | 12 +
9 files changed, 743 insertions(+), 3 deletions(-)
create mode 100644 kernel/seccomp_filter.c
create mode 100644 Documentation/prctl/seccomp_filter.txt
--
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Will Drewry

unread,

Jan 11, 2012, 12:30:03 PM1/11/12

to

This patch adds support for seccomp mode 2. This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified
by a userland task. The policy is expressed in terms of a BPF program,
as is used for userland-exposed socket filtering. Instead of network
data, the BPF program is evaluated over struct user_regs_struct at the
time of the system call (as retrieved using regviews).

A filter program may be installed by a userland task by calling
prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.

If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached. All attached programs
must be evaluated before a system call will be allowed to proceed.

To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() and current->personality, it
is not allowed to make system calls or attach additional filters which
use a different combination of is_compat_task() and
current->personality.

Filter programs may _only_ cross the execve(2) barrier if last filter
program was attached by a task with CAP_SYS_ADMIN capabilities in its
user namespace. Once a task-local filter program is attached from a
process without privileges, execve will fail. This ensures that only
privileged parent task can affect its privileged children (e.g., setuid
binary).

There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: expected register layout and system
call numbers.
- Full register information is provided which may be relevant for
certain syscalls (fork, rt_sigreturn) or for other userland
filtering tactics (checking the PC).
- No time-of-check-time-of-use vulnerable data accesses are possible.

This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code. It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)

Signed-off-by: Will Drewry <w...@chromium.org>
---

diff --git a/fs/exec.c b/fs/exec.c
index 3625464..e9cc89c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -44,6 +44,7 @@
#include <linux/namei.h>
#include <linux/mount.h>
#include <linux/security.h>
+#include <linux/seccomp.h>
#include <linux/syscalls.h>
#include <linux/tsacct_kern.h>
#include <linux/cn_proc.h>
@@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename,
if (retval)
goto out_ret;

+ retval = seccomp_check_exec();
+ if (retval)
+ goto out_ret;
+
retval = -ENOMEM;
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
if (!bprm)
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..15e2460 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -64,6 +64,9 @@
#define PR_GET_SECCOMP 21
#define PR_SET_SECCOMP 22

+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER 36
+
/* Get/set the capability bounding set (as per security/commoncap.c) */
#define PR_CAPBSET_READ 23
#define PR_CAPBSET_DROP 24
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..99d163e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,9 +5,28 @@
#ifdef CONFIG_SECCOMP

#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>

-typedef struct { int mode; } seccomp_t;
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode:
+ * if this is 0, seccomp is not in use.
+ * is 1, the process is under standard seccomp rules.
+ * is 2, the process is only allowed to make system calls where
+ * associated filters evaluate successfully.
+ * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
+ * @filter must only be accessed from the context of current as there
+ * is no guard.
+ */
+typedef struct seccomp_struct {
+ int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+ struct seccomp_filter *filter;
+#endif
+} seccomp_t;

extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)

#include <linux/errno.h>

-typedef struct { } seccomp_t;
-
+typedef struct seccomp_struct { } seccomp_t;
#define secure_computing(x) do { } while (0)

static inline long prctl_get_seccomp(void)
@@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)

#endif /* CONFIG_SECCOMP */

+#ifdef CONFIG_SECCOMP_FILTER
+
+#define seccomp_filter_init_task(_tsk) do { \
+ (_tsk)->seccomp.filter = NULL; \
+} while (0);
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+#define seccomp_filter_free_task(_tsk) do { \
+ put_seccomp_filter((_tsk)->seccomp.filter); \
+} while (0);
+
+extern int seccomp_check_exec(void);
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_filter_fork(struct task_struct *child,
+ struct task_struct *parent);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+#define seccomp_filter_init_task(_tsk) do { } while (0);
+#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
+#define seccomp_filter_free_task(_tsk) do { } while (0);
+
+static inline int seccomp_check_exec(void)
+{
+ return 0;
+}
+
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+ return -ENOSYS;
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..0584090 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index da4a6a1..cc1d628 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ seccomp_filter_free_task(tsk);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
/* Perform scheduler related setup. Assign this task to a CPU. */
sched_fork(p);

+ seccomp_filter_init_task(p);
retval = perf_event_init_task(p);
if (retval)
goto bad_fork_cleanup_policy;
@@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (clone_flags & CLONE_THREAD)
threadgroup_fork_read_unlock(current);
perf_event_fork(p);
+ seccomp_filter_fork(p, current);
return p;

bad_fork_free_pid:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..78719be 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case 2:
+ if (seccomp_test_filters(this_syscall) == 0)
+ return;
+
+ seccomp_filter_log_failure(this_syscall);
+ break;
+#endif
default:
BUG();
}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..4770847
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,639 @@
+/* bpf program-based system call filtering
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromiu...@chromium.org>
+ */
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/security.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object lifetime.
+ * get/put helpers should be used when accessing an instance
+ * outside of a lifetime-guarded section. In general, this
+ * is only needed for handling filters shared across tasks.
+ * @creator: pointer to the pid that created this filter
+ * @parent: pointer to the ancestor which this filter will be composed with.
+ * @flags: provide information about filter from creation time.
+ * @personality: personality of the process at filter creation time.
+ * @insns: the BPF program instructions to evaluate
+ * @count: the number of instructions in the program.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+ struct kref usage;
+ struct pid *creator;
+ struct seccomp_filter *parent;
+ struct {
+ uint32_t admin:1, /* can allow execve */
+ compat:1, /* CONFIG_COMPAT */
+ __reserved:30;
+ } flags;
+ int personality;
+ unsigned short count; /* Instruction count */
+ struct sock_filter insns[0];
+};
+
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *);
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ * @padding: size of the insns[0] array in bytes
+ *
+ * The @padding should be a multiple of
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
+{
+ struct seccomp_filter *f;
+ unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
+
+ /* Drop oversized requests. */
+ if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
+ return ERR_PTR(-EINVAL);
+
+ /* Padding should always be in sock_filter increments. */
+ BUG_ON(padding % sizeof(struct sock_filter));
+
+ f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
+ if (!f)
+ return ERR_PTR(-ENOMEM);
+ kref_init(&f->usage);
+ f->creator = get_task_pid(current, PIDTYPE_PID);
+ f->count = bpf_blocks;
+ return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ * @filter: NULL or live object to be completely destructed.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+ if (!filter)
+ return;
+ put_seccomp_filter(filter->parent);
+ put_pid(filter->creator);
+ kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+ struct seccomp_filter *orig =
+ container_of(kref, struct seccomp_filter, usage);
+ seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+ pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
+ current->comm, task_pid_nr(current), syscall,
+ KSTK_EIP(current));
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return;
+ kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return NULL;
+ kref_get(&orig->usage);
+ return orig;
+}
+
+static int seccomp_check_personality(struct seccomp_filter *filter)
+{
+ if (filter->personality != current->personality)
+ return -EACCES;
+#ifdef CONFIG_COMPAT
+ if (filter->flags.compat != (!!(is_compat_task())))
+ return -EACCES;
+#endif
+ return 0;
+}
+
+static const struct user_regset *
+find_prstatus(const struct user_regset_view *view)
+{
+ const struct user_regset *regset;
+ int n;
+
+ /* Skip 0. */
+ for (n = 1; n < view->n; ++n) {
+ regset = view->regsets + n;
+ if (regset->core_note_type == NT_PRSTATUS)
+ return regset;
+ }
+
+ return NULL;
+}
+
+/**
+ * seccomp_get_regs - returns a pointer to struct user_regs_struct
+ * @scratch: preallocated storage of size @available
+ * @available: pointer to the size of scratch.
+ *
+ * Returns NULL if the registers cannot be acquired or copied.
+ * Returns a populated pointer to @scratch by default.
+ * Otherwise, returns a pointer to a a u8 array containing the struct
+ * user_regs_struct appropriate for the task personality. The pointer
+ * may be to the beginning of @scratch or to an externally managed data
+ * structure. On success, @available should be updated with the
+ * valid region size of the returned pointer.
+ *
+ * If the architecture overrides the linkage, then the pointer may pointer to
+ * another location.
+ */
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+ /* regset is usually returned based on task personality, not current
+ * system call convention. This behavior makes it unsafe to execute
+ * BPF programs over regviews if is_compat_task or the personality
+ * have changed since the program was installed.
+ */
+ const struct user_regset_view *view = task_user_regset_view(current);
+ const struct user_regset *regset = &view->regsets[0];
+ size_t scratch_size = *available;
+ if (regset->core_note_type != NT_PRSTATUS) {
+ /* The architecture should override this method for speed. */
+ regset = find_prstatus(view);
+ if (!regset)
+ return NULL;
+ }
+ *available = regset->n * regset->size;
+ /* Make sure the scratch space isn't exceeded. */
+ if (*available > scratch_size)
+ *available = scratch_size;
+ if (regset->get(current, regset, 0, *available, scratch, NULL))
+ return NULL;
+ return scratch;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+ struct seccomp_filter *filter;
+ u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
+ size_t regs_size = sizeof(struct user_regs_struct);
+ int ret = -EACCES;
+
+ filter = current->seccomp.filter; /* uses task ref */
+ if (!filter)
+ goto out;
+
+ /* All filters in the list are required to share the same system call
+ * convention so only the first filter is ever checked.
+ */
+ if (seccomp_check_personality(filter))
+ goto out;
+
+ /* Grab the user_regs_struct. Normally, regs == &regs_tmp, but
+ * that is not mandatory. E.g., it may return a point to
+ * task_pt_regs(current). NULL checking is mandatory.
+ */
+ regs = seccomp_get_regs(regs_tmp, &regs_size);
+ if (!regs)
+ goto out;
+
+ /* Only allow a system call if it is allowed in all ancestors. */
+ ret = 0;
+ for ( ; filter != NULL; filter = filter->parent) {
+ /* Allowed if return value is the size of the data supplied. */
+ if (seccomp_run_filter(regs, regs_size, filter->insns) !=
+ regs_size)
+ ret = -EACCES;
+ }
+out:
+ return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+ struct seccomp_filter *filter = NULL;
+ /* Note, len is a short so overflow should be impossible. */
+ unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+ long ret = -EPERM;
+
+ /* Allocate a new seccomp_filter */
+ filter = seccomp_filter_alloc(fp_size);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+
+ /* Lock the process personality and calling convention. */
+#ifdef CONFIG_COMPAT
+ if (is_compat_task())
+ filter->flags.compat = 1;
+#endif
+ filter->personality = current->personality;
+
+ /* Auditing is not needed since the capability wasn't requested */
+ if (security_real_capable_noaudit(current, current_user_ns(),
+ CAP_SYS_ADMIN) == 0)
+ filter->flags.admin = 1;
+
+ /* Copy the instructions from fprog. */
+ ret = -EFAULT;
+ if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ goto out;
+
+ /* Check the fprog */
+ ret = sk_chk_filter(filter->insns, filter->count);
+ if (ret)
+ goto out;
+
+ /* If there is an existing filter, make it the parent
+ * and reuse the existing task-based ref.
+ */
+ filter->parent = current->seccomp.filter;
+
+ /* Force all filters to use one system call convention. */
+ ret = -EINVAL;
+ if (filter->parent) {
+ if (filter->parent->flags.compat != filter->flags.compat)
+ goto out;
+ if (filter->parent->personality != filter->personality)
+ goto out;
+ }
+
+ /* Double claim the new filter so we can release it below simplifying
+ * the error paths earlier.
+ */
+ ret = 0;
+ get_seccomp_filter(filter);
+ current->seccomp.filter = filter;
+ /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
+ if (!current->seccomp.mode) {
+ current->seccomp.mode = 2;
+ set_thread_flag(TIF_SECCOMP);
+ }
+
+out:
+ put_seccomp_filter(filter); /* for get or task, on err */
+ return ret;
+}
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+ struct sock_fprog fprog;
+ long ret = -EINVAL;
+
+ ret = -EFAULT;
+ if (!user_filter)
+ goto out;
+
+ if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+ goto out;
+
+ ret = seccomp_attach_filter(&fprog);
+out:
+ return ret;
+}
+
+/**
+ * seccomp_check_exec: determines if exec is allowed for current
+ * Returns 0 if allowed.
+ */
+int seccomp_check_exec(void)
+{
+ if (current->seccomp.mode != 2)
+ return 0;
+ /* We can rely on the task refcount for the filter. */
+ if (!current->seccomp.filter)
+ return -EPERM;
+ /* The last attached filter set for the process is checked. It must
+ * have been installed with CAP_SYS_ADMIN capabilities.
+ */
+ if (current->seccomp.filter->flags.admin)
+ return 0;
+ return -EPERM;
+}
+
+/* seccomp_filter_fork: manages inheritance on fork
+ * @child: forkee
+ * @parent: forker
+ * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
+ * and the set of filters is marked as 'enabled'.
+ */
+void seccomp_filter_fork(struct task_struct *child,
+ struct task_struct *parent)
+{
+ if (!parent->seccomp.mode)
+ return;
+ child->seccomp.mode = parent->seccomp.mode;
+ child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter);
+}
+
+/* Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries. The signature almost matches the signature from
+ * net/core/filter.c with the hopes of sharing code in the future.
+ */
+static const void *load_pointer(const u8 *buf, size_t buflen,
+ int offset, size_t size,
+ void *unused)
+{
+ if (offset >= buflen)
+ goto fail;
+ if (offset < 0)
+ goto fail;
+ if (size > buflen - offset)
+ goto fail;
+ return buf + offset;
+fail:
+ return NULL;
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF (over user_regs_struct)
+ * @buf: buffer to execute the filter over
+ * @buflen: length of the buffer
+ * @fentry: filter to apply
+ *
+ * Decode and apply filter instructions to the buffer.
+ * Return length to keep, 0 for none. @buf is a regset we are
+ * filtering, @filter is the array of filter instructions.
+ * Because all jumps are guaranteed to be before last instruction,
+ * and last instruction guaranteed to be a RET, we dont need to check
+ * flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage of
+ * optimizations (like JIT).
+ *
+ * A successful filter must return the full length of the data. Anything less
+ * will currently result in a seccomp failure. In the future, it may be
+ * possible to use that for hard filtering registers on the fly so it is
+ * ideal for consumers to return 0 on intended failure.
+ */
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *fentry)
+{
+ const void *ptr;
+ u32 A = 0; /* Accumulator */
+ u32 X = 0; /* Index Register */
+ u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */
+ u32 tmp;
+ int k;
+
+ /*
+ * Process array of filter instructions.
+ */
+ for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define K (fentry->k)
+#else
+ const u32 K = fentry->k;
+#endif
+
+ switch (fentry->code) {
+ case BPF_S_ALU_ADD_X:
+ A += X;
+ continue;
+ case BPF_S_ALU_ADD_K:
+ A += K;
+ continue;
+ case BPF_S_ALU_SUB_X:
+ A -= X;
+ continue;
+ case BPF_S_ALU_SUB_K:
+ A -= K;
+ continue;
+ case BPF_S_ALU_MUL_X:
+ A *= X;
+ continue;
+ case BPF_S_ALU_MUL_K:
+ A *= K;
+ continue;
+ case BPF_S_ALU_DIV_X:
+ if (X == 0)
+ return 0;
+ A /= X;
+ continue;
+ case BPF_S_ALU_DIV_K:
+ A = reciprocal_divide(A, K);
+ continue;
+ case BPF_S_ALU_AND_X:
+ A &= X;
+ continue;
+ case BPF_S_ALU_AND_K:
+ A &= K;
+ continue;
+ case BPF_S_ALU_OR_X:
+ A |= X;
+ continue;
+ case BPF_S_ALU_OR_K:
+ A |= K;
+ continue;
+ case BPF_S_ALU_LSH_X:
+ A <<= X;
+ continue;
+ case BPF_S_ALU_LSH_K:
+ A <<= K;
+ continue;
+ case BPF_S_ALU_RSH_X:
+ A >>= X;
+ continue;
+ case BPF_S_ALU_RSH_K:
+ A >>= K;
+ continue;
+ case BPF_S_ALU_NEG:
+ A = -A;
+ continue;
+ case BPF_S_JMP_JA:
+ fentry += K;
+ continue;
+ case BPF_S_JMP_JGT_K:
+ fentry += (A > K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_K:
+ fentry += (A >= K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_K:
+ fentry += (A == K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_K:
+ fentry += (A & K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGT_X:
+ fentry += (A > X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_X:
+ fentry += (A >= X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_X:
+ fentry += (A == X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_X:
+ fentry += (A & X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_LD_W_ABS:
+ k = K;
+load_w:
+ ptr = load_pointer(buf, buflen, k, 4, &tmp);
+ if (ptr != NULL) {
+ /* Note, unlike on network data, values are not
+ * byte swapped.
+ */
+ A = *(const u32 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_H_ABS:
+ k = K;
+load_h:
+ ptr = load_pointer(buf, buflen, k, 2, &tmp);
+ if (ptr != NULL) {
+ A = *(const u16 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_B_ABS:
+ k = K;
+load_b:
+ ptr = load_pointer(buf, buflen, k, 1, &tmp);
+ if (ptr != NULL) {
+ A = *(const u8 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_W_LEN:
+ A = buflen;
+ continue;
+ case BPF_S_LDX_W_LEN:
+ X = buflen;
+ continue;
+ case BPF_S_LD_W_IND:
+ k = X + K;
+ goto load_w;
+ case BPF_S_LD_H_IND:
+ k = X + K;
+ goto load_h;
+ case BPF_S_LD_B_IND:
+ k = X + K;
+ goto load_b;
+ case BPF_S_LDX_B_MSH:
+ ptr = load_pointer(buf, buflen, K, 1, &tmp);
+ if (ptr != NULL) {
+ X = (*(u8 *)ptr & 0xf) << 2;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_IMM:
+ A = K;
+ continue;
+ case BPF_S_LDX_IMM:
+ X = K;
+ continue;
+ case BPF_S_LD_MEM:
+ A = mem[K];
+ continue;
+ case BPF_S_LDX_MEM:
+ X = mem[K];
+ continue;
+ case BPF_S_MISC_TAX:
+ X = A;
+ continue;
+ case BPF_S_MISC_TXA:
+ A = X;
+ continue;
+ case BPF_S_RET_K:
+ return K;
+ case BPF_S_RET_A:
+ return A;
+ case BPF_S_ST:
+ mem[K] = A;
+ continue;
+ case BPF_S_STX:
+ mem[K] = X;
+ continue;
+ case BPF_S_ANC_PROTOCOL:
+ case BPF_S_ANC_PKTTYPE:
+ case BPF_S_ANC_IFINDEX:
+ case BPF_S_ANC_MARK:
+ case BPF_S_ANC_QUEUE:
+ case BPF_S_ANC_HATYPE:
+ case BPF_S_ANC_RXHASH:
+ case BPF_S_ANC_CPU:
+ case BPF_S_ANC_NLATTR:
+ case BPF_S_ANC_NLATTR_NEST:
+ /* ignored */
+ continue;
+ default:
+ WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
+ fentry->code, fentry->jt,
+ fentry->jf, fentry->k);
+ return 0;
+ }
+ }
+
+ return 0;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 481611f..77f2eda 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_SECCOMP:
error = prctl_set_seccomp(arg2);
break;
+ case PR_ATTACH_SECCOMP_FILTER:
+ error = prctl_attach_seccomp_filter((char __user *)
+ arg2);
+ break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..77b1106 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT

If you are unsure how to answer this question, answer N.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ select SECCOMP
+ depends on EXPERIMENTAL
+ help
+ This kernel feature expands CONFIG_SECCOMP to allow computing
+ in environments with reduced kernel access dictated by a system
+ call filter, expressed in BPF, installed by the application itself
+ through prctl(2).
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SECURITY
bool "Enable different security models"
depends on SYSFS

Will Drewry

unread,

Jan 11, 2012, 12:30:05 PM1/11/12

to

Document how system call filtering with BPF works
and can be used.

Signed-off-by: Will Drewry <w...@chromium.org>
---

Documentation/prctl/seccomp_filter.txt | 159 ++++++++++++++++++++++++++++++++
1 files changed, 159 insertions(+), 0 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..5fb3f44
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,159 @@
+ Seccomp filtering
+ =================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls. The filter is expressed as a Berkeley Packet
+Filter program, as with socket filters, except that the data operated on
+is the current user_regs_struct. This allows for expressive filtering
+of system calls using the pre-existing system call ABI and using a filter
+program language with a long history of being exposed to userland.
+Additionally, BPF makes it impossible for users of seccomp to fall prey to
+time-of-check-time-of-use (TOCTOU) attacks that are common in system call
+interposition frameworks because the evaluated data is solely register state
+just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing. Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by the
+consuming process. The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+ Allows the specification of a new filter using a BPF program.
+ The BPF program will be executed over a user_regs_struct data
+ reflecting system call time except with the system call number
+ resident in orig_[register]. To allow a system call, the size
+ of the data must be returned. At present, all other return values
+ result in the system call being blocked, but it is recommended to
+ return 0 in those cases. This will allow for future custom return
+ values to be introduced, if ever desired.
+
+ Usage:
+ prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+ The 'prog' argument is a pointer to a struct sock_fprog which will
+ contain the filter program. If the program is invalid, the call
+ will return -1 and set errno to -EINVAL.
+
+ The struct user_regs_struct the @prog will see is based on the
+ personality of the task at the time of this prctl call. Additionally,
+ is_compat_task is also tracked for the @prog. This means that once set
+ the calling task will have all of its system calls blocked if it
+ switches its system call ABI (via personality or other means).
+
+ If the @prog is installed while the task has CAP_SYS_ADMIN in its user
+ namespace, the @prog will be marked as inheritable across execve. Any
+ inherited filters are still subject to the system call ABI constraints
+ above and any ABI mismatched system calls will result in process death.
+
+All of the above calls return 0 on success and non-zero on error.
+
+
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/err and exit
+cleanly. Without using a BPF compiler, it may be done as follows on x86 32-bit:
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+int install_filter(void)
+{
+ struct sock_filter filter[] = {
+ /* Grab the system call number */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
+ /* Jump table for the allowed syscalls */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+ /* Check that read is only using stdin. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+ /* Check that write is only using stdout/stderr */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+ /* Put the "accept" value in A */
+ BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+ BPF_STMT(BPF_RET+BPF_A,0),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (prctl(36, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);

+ return 0;
+}
+

+Additionally, if prctl(2) is allowed by the installed filter, additional
+filters may be layered on which will increase evaluation time, but allow for
+further decreasing the attack surface during execution of a process.
+
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+ a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.

Jonathan Corbet

unread,

Jan 11, 2012, 3:10:01 PM1/11/12

to

Interesting approach to the problem, I think I like it. Watch for news at
11...:)

One nit:

> +Example
> +-------
> +
> +Assume a process would like to cleanly read and write to stdin/out/err and exit
> +cleanly. Without using a BPF compiler, it may be done as follows on x86 32-bit:
> +

It seems like this little program belongs in the samples/ directory.

Thanks,

jon

Will Drewry

unread,

Jan 11, 2012, 3:20:02 PM1/11/12

to

On Wed, Jan 11, 2012 at 2:03 PM, Jonathan Corbet <cor...@lwn.net> wrote:
> Interesting approach to the problem, I think I like it. Watch for news at
> 11...:)

Thanks - I'm glad to hear it!

> One nit:
>
>> +Example
>> +-------
>> +
>> +Assume a process would like to cleanly read and write to stdin/out/err and exit
>> +cleanly. Without using a BPF compiler, it may be done as follows on x86 32-bit:
>> +
>
> It seems like this little program belongs in the samples/ directory.

Cool - I'll do that and rev this patch.

cheers!
will

Will Drewry

unread,

Jan 11, 2012, 6:30:02 PM1/11/12

to

Document how system call filtering with BPF works and

may be used. Includes an example for x86 (32-bit).

Signed-off-by: Will Drewry <w...@chromium.org>
---

Documentation/prctl/seccomp_filter.txt | 99 ++++++++++++++++++++++++++++++++
samples/Makefile | 2 +-
samples/seccomp/Makefile | 12 ++++
samples/seccomp/bpf-example.c | 74 ++++++++++++++++++++++++
4 files changed, 186 insertions(+), 1 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
create mode 100644 samples/seccomp/Makefile
create mode 100644 samples/seccomp/bpf-example.c

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644

index 0000000..15d4645
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,99 @@

+ Additionally, if prctl(2) is allowed by the attached filter,
+ additional filters may be layered on which will increase evaluation
+ time, but allow for further decreasing the attack surface during
+ execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Example
+-------
+
+samples/seccomp-bpf-example.c shows an example process that allows read from stdin,
+write to stdout/err, exit and signal returns for 32-bit x86.

+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+ a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.

diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
# Makefile for Linux samples code

obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..80dc8e4
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,12 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-$(CONFIG_X86_32) := bpf-example
+bpf-example-objs := bpf-example.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-example.o += -I$(objtree)/usr/include -m32
+HOSTLOADLIBES_bpf-example += -m32
diff --git a/samples/seccomp/bpf-example.c b/samples/seccomp/bpf-example.c
new file mode 100644
index 0000000..f98b70a
--- /dev/null
+++ b/samples/seccomp/bpf-example.c
@@ -0,0 +1,74 @@
+/*
+ * Seccomp BPF example
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromiu...@chromium.org>
+ * Author: Will Drewry <w...@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */

+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>

+#include <sys/prctl.h>

+#include <sys/user.h>
+#include <unistd.h>
+

+#ifndef PR_ATTACH_SECCOMP_FILTER
+# define PR_ATTACH_SECCOMP_FILTER 36
+#endif

+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))

+static int install_filter(void)

+ if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {

+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ return 0;
+}

--
1.7.5.4

Will Drewry

unread,

Jan 11, 2012, 7:30:02 PM1/11/12

to

Hrm, I may need to guard sample compilation based on host arch and not
just target arch. Documentation v3 will be on the way once I have that
behaving properly. :/

Sorry!
will

Serge Hallyn

unread,

Jan 12, 2012, 4:00:02 AM1/12/12

to

Hey Will,

A few comments below, but otherwise

Acked-by: Serge Hallyn <serge....@canonical.com>

thanks,
-serge

I still think the BUG_ON here is harsh given that the progsize is passed
in by userspace. Was there a reason not to return -EINVAL here?

This comment is confusing. By 'It must' you mean that if not, it's
denied. But if I didn't know better I would read that as "we can't
get to this code unless". Can you change it to something like
"Exec is refused unless the filter was installed with CAP_SYS_ADMIN
privilege"?

Łukasz Sowa

unread,

Jan 12, 2012, 8:20:01 AM1/12/12

to

Hi Will,

That's very different approach to the system call interposition problem.
I find you solution very interesting. It gives far more capabilities
than my syscalls cgroup that you commented on some time ago. It's ready
now but I haven't tried filtering yet. I think that if your solution
make it to the mainline (and I guess that's really possible at current
stage :)), there will be no place for mine solution but that's ok.

There's one thing that I'm curious about - have you measured overhead in
any way? That was one of the biggest issues in all previous attempts to
limit syscalls. I'd love to compare the numbers with mine solution.

I'll examine your patch later on and put some comments if I bump into
something.

Best Regards,
Lukasz Sowa

Oleg Nesterov

unread,

Jan 12, 2012, 10:00:02 AM1/12/12

to

On 01/11, Will Drewry wrote:
>
> This patch adds support for seccomp mode 2. This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task. The policy is expressed in terms of a BPF program,
> as is used for userland-exposed socket filtering. Instead of network
> data, the BPF program is evaluated over struct user_regs_struct at the
> time of the system call (as retrieved using regviews).

Cool ;)

I didn't really read this patch yet, just one nit.

> +#define seccomp_filter_init_task(_tsk) do { \
> + (_tsk)->seccomp.filter = NULL; \
> +} while (0);

Cosmetic and subjective, but imho it would be better to add inline
functions instead of define's.

> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
> free_thread_info(tsk->stack);
> rt_mutex_debug_task_free(tsk);
> ftrace_graph_exit_task(tsk);
> + seccomp_filter_free_task(tsk);
> free_task_struct(tsk);
> }
> EXPORT_SYMBOL(free_task);
> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> /* Perform scheduler related setup. Assign this task to a CPU. */
> sched_fork(p);
>
> + seccomp_filter_init_task(p);

This doesn't look right or I missed something. something seccomp_filter_init_task()
should be called right after dup_task_struct(), at least before copy process can
fail.

Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can put
current->seccomp.filter copied by arch_dup_task_struct().

> +struct seccomp_filter {
> + struct kref usage;
> + struct pid *creator;

Why? seccomp_filter->creator is never used, no?

Oleg.

Steven Rostedt

unread,

Jan 12, 2012, 10:50:02 AM1/12/12

to

On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:

> Filter programs may _only_ cross the execve(2) barrier if last filter
> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> user namespace. Once a task-local filter program is attached from a
> process without privileges, execve will fail. This ensures that only
> privileged parent task can affect its privileged children (e.g., setuid
> binary).

This means that a non privileged user can not run another program with
limited features? How would a process exec another program and filter
it? I would assume that the filter would need to be attached first and
then the execv() would be performed. But after the filter is attached,
the execv is prevented?

Maybe I don't understand this correctly.

-- Steve

Andrew Lutomirski

unread,

Jan 12, 2012, 11:20:02 AM1/12/12

to

On Thu, Jan 12, 2012 at 7:43 AM, Steven Rostedt <ros...@goodmis.org> wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace. Once a task-local filter program is attached from a
>> process without privileges, execve will fail. This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?
>
> Maybe I don't understand this correctly.

Time to resurrect execve_nosecurity? If so, then the rule could be
simplified to: seccomp programs cannot use normal execve at all.

The longer I linger on lists and see neat ideas like this, the more I
get annoyed that execve is magical. I dream of a distribution that
doesn't use setuid, file capabilities, selinux transitions on exec, or
any other privilege changes on exec *at all*. I think that the only
things missing in the kernel (other than something intelligent to do
about SELinux) are execve_nosecurity and the ability for a normal
program to wait for an unrelated program to finish (or some other way
that a program can ask a daemon to spawn a privileged program for it
and then to cleanly wait for that program to finish in a way that
could survive re-exec of the daemon).

--Andy

Oleg Nesterov

unread,

Jan 12, 2012, 11:30:02 AM1/12/12

to

On 01/11, Will Drewry wrote:
>

Stupid question. I am sure you know what are you doing ;) and I know
nothing about !x86 arches.

But could you explain why it is designed to use user_regs_struct ?
Why we can't simply use task_pt_regs() and avoid the (costly) regsets?

Just curious.

Oleg.

Oleg Nesterov

unread,

Jan 12, 2012, 11:30:02 AM1/12/12

to

On 01/12, Steven Rostedt wrote:
>
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
> > Filter programs may _only_ cross the execve(2) barrier if last filter
> > program was attached by a task with CAP_SYS_ADMIN capabilities in its
> > user namespace. Once a task-local filter program is attached from a
> > process without privileges, execve will fail. This ensures that only
> > privileged parent task can affect its privileged children (e.g., setuid
> > binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?
>
> Maybe I don't understand this correctly.

May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.

OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.

Oleg.

Alan Cox

unread,

Jan 12, 2012, 11:30:02 AM1/12/12

to

> Filter programs may _only_ cross the execve(2) barrier if last filter
> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> user namespace. Once a task-local filter program is attached from a
> process without privileges, execve will fail. This ensures that only
> privileged parent task can affect its privileged children (e.g., setuid
> binary).

I think this model is wrong. The rest of the policy rules all work on the
basis that dumpable is the decider (the same rules for not dumping, not
tracing, etc). A user should be able to apply filter to their own code
arbitarily. Any setuid app should IMHO lose the trace subject to the usual
uid rules and capability rules. That would seem to be more flexible and
also the path of least surprise.

[plus you can implement non setuid exec entirely in userspace so it's
a rather meaningless distinction you propose]

> be tackled separately via separate patchsets. (And at some point sharing
> BPF JIT code!)

A BPF jit ought to be trivial and would be a big win.

In general I like this approach. It's simple, it's compact and it offers
interesting possibilities for solving some interesting problem spaces,
without the full weight of SELinux, SMACK etc which are still needed for
heavyweight security.

Alan

Steven Rostedt

unread,

Jan 12, 2012, 11:30:02 AM1/12/12

to

On Thu, 2012-01-12 at 08:14 -0800, Andrew Lutomirski wrote:

> The longer I linger on lists and see neat ideas like this, the more I
> get annoyed that execve is magical. I dream of a distribution that
> doesn't use setuid, file capabilities, selinux transitions on exec, or
> any other privilege changes on exec *at all*.

Is that the fear with filtering on execv? That if we have filters on an
execv calling a setuid program that we change the behavior of that
privileged program and might cause unexpected results?

In that case, just have execv fail if filtering is enabled and we are
execing a setuid program. But I don't see why non "magical" execv's
should be prohibited.

Steven Rostedt

unread,

Jan 12, 2012, 11:40:02 AM1/12/12

to

On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:

> May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
>
> OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.

I've never used seccomp, so I admit I'm totally ignorant on this topic.

But looking at seccomp from the outside, the biggest advantage to this
would be the ability for normal processes to be able to limit tasks it
kicks off. If I want to run a task in a sandbox, I don't want to be root
to do so.

I guess a web browser doesn't perform an exec to run java programs. But
it would be nice if I could execute something from the command line that
I could run in a sand box.

What's the problem with making sure that the setuid isn't set before
doing an execv? Only fail when setuid (or some other magic) is enabled
on the file being exec'd.

Or is this a race where I can have a soft link pointing to a normal
file, run this, and have the link change to a setuid file at just the
right time that causes it to happen?

-- Steve

Oleg Nesterov

unread,

Jan 12, 2012, 12:00:02 PM1/12/12

to

On 01/12, Steven Rostedt wrote:
>

> On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:
>
> > May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> > cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
> >
> > OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
>
> I've never used seccomp, so I admit I'm totally ignorant on this topic.

me too ;)

> But looking at seccomp from the outside, the biggest advantage to this
> would be the ability for normal processes to be able to limit tasks it
> kicks off. If I want to run a task in a sandbox, I don't want to be root
> to do so.
>
> I guess a web browser doesn't perform an exec to run java programs. But
> it would be nice if I could execute something from the command line that
> I could run in a sand box.
>
> What's the problem with making sure that the setuid isn't set before
> doing an execv? Only fail when setuid (or some other magic) is enabled
> on the file being exec'd.

I agree. That is why I mentioned LSM_UNSAFE_SECCOMP/cap_bprm_set_creds.
Just I do not know what would be the most simple/clean way to do this.

And in any case I agree that the current seccomp_check_exec() looks
strange. Btw, it does
{

if (current->seccomp.mode != 2)

return 0;

/* We can rely on the task refcount for the filter. */

if (!current->seccomp.filter)
return -EPERM;

How it is possible to have seccomp.filter == NULL with mode == 2?

Oleg.

Will Drewry

unread,

Jan 12, 2012, 12:00:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <ros...@goodmis.org> wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace. Once a task-local filter program is attached from a
>> process without privileges, execve will fail. This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?

Yeah - it means tasks can filter themselves, but not each other.
However, you can inject a filter for any dynamically linked executable
using LD_PRELOAD.

> Maybe I don't understand this correctly.

You're right on. This was to ensure that one process didn't cause
crazy behavior in another. I think Alan has a better proposal than
mine below. (Goes back to catching up.)

Will Drewry

unread,

Jan 12, 2012, 12:00:01 PM1/12/12

to

Thanks! Unimportant responses below. Fixes will be incorporated in
the next round (along with Oleg's feedback).

cheers,
will

I've changed it in the next revision. As is, I don't believe
userspace can control
the size of padding directly, just the increment since it specifies
its length in terms
of bpf blocks (sizeof(struct sock_filter)). But EINVAL is certainly
less aggressive :)

Sounds good!

Will Drewry

unread,

Jan 12, 2012, 12:00:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 8:50 AM, Oleg Nesterov <ol...@redhat.com> wrote:
> On 01/11, Will Drewry wrote:
>>
>> This patch adds support for seccomp mode 2. This mode enables dynamic
>> enforcement of system call filtering policy in the kernel as specified
>> by a userland task. The policy is expressed in terms of a BPF program,
>> as is used for userland-exposed socket filtering. Instead of network
>> data, the BPF program is evaluated over struct user_regs_struct at the
>> time of the system call (as retrieved using regviews).
>
> Cool ;)
>
> I didn't really read this patch yet, just one nit.
>
>> +#define seccomp_filter_init_task(_tsk) do { \
>> + (_tsk)->seccomp.filter = NULL; \
>> +} while (0);
>
> Cosmetic and subjective, but imho it would be better to add inline
> functions instead of define's.

Refactoring it a bit to make that possible. Since seccomp fork/init/free
never needs access to the whole task_structs, I'll just pass in what's
needed (and avoid the sched.h inclusion recursion).

Comments on the next round will most definitely be appreciated!

>> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
>> free_thread_info(tsk->stack);
>> rt_mutex_debug_task_free(tsk);
>> ftrace_graph_exit_task(tsk);
>> + seccomp_filter_free_task(tsk);
>> free_task_struct(tsk);
>> }
>> EXPORT_SYMBOL(free_task);
>> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> /* Perform scheduler related setup. Assign this task to a CPU. */
>> sched_fork(p);
>>
>> + seccomp_filter_init_task(p);
>
> This doesn't look right or I missed something. something seccomp_filter_init_task()
> should be called right after dup_task_struct(), at least before copy process can
> fail.
>
> Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can put
> current->seccomp.filter copied by arch_dup_task_struct().

Ah - makes sense! I moved it under dup_task_struct before any goto's
to bad_fork_free.

>> +struct seccomp_filter {
>> + struct kref usage;
>> + struct pid *creator;
>
> Why? seccomp_filter->creator is never used, no?

Removing it. It is from a related patch I'm experimenting with (adding
optional tracehook support), but it has no bearing here.

Thanks - new patch revision incoming!
will

Andrew Lutomirski

unread,

Jan 12, 2012, 12:00:03 PM1/12/12

to

On Thu, Jan 12, 2012 at 8:27 AM, Steven Rostedt <ros...@goodmis.org> wrote:
> On Thu, 2012-01-12 at 08:14 -0800, Andrew Lutomirski wrote:
>
>> The longer I linger on lists and see neat ideas like this, the more I
>> get annoyed that execve is magical. I dream of a distribution that
>> doesn't use setuid, file capabilities, selinux transitions on exec, or
>> any other privilege changes on exec *at all*.
>
> Is that the fear with filtering on execv? That if we have filters on an
> execv calling a setuid program that we change the behavior of that
> privileged program and might cause unexpected results?

Exactly.

>
> In that case, just have execv fail if filtering is enabled and we are
> execing a setuid program. But I don't see why non "magical" execv's
> should be prohibited.
>

How do you define "non-magical"?

If setuid is set, then it's obviously magical. On a nosuid
filesystem, strange things happen. If file capabilities are enabled
and set, then different magic happens. With LSMs involved, anything
can be magical. (SELinux AFAICT looks up rules on every single exec,
so it might be impossible for execve to be non-magical.) If execve is
banned entirely when seccomp is enabled, then there will never be any
attacks based on abusing these mechanisms.

My proposal is to have an alternative mechanism that, from a security
POV, does nothing that the caller couldn't have done on its own. The
only reason it would be needed at all is because implementing execve
with correct semantics from userspace is a PITA -- the right set of
fds needs to be closed, threads need to be killed (without races),
vmas need to be found an unmapped, a new program needs to be mapped in
(possibly at the same place that the old one was mapped at),
/proc/self/exe needs to be updated, auxv needs to be recreated
(including using values that glibc might have erased already), etc.

The code is short and it works (although I have no idea whether it
applies to current kernels).

Oleg: my only issue with setting something like LSM_UNSAFE_SECCOMP is
that a different class of vulnerability might be introduced: take a
setuid program that calls other setuid programs (or just uses execve
as a way to get the default execve capability handling, SELinux
handling, etc), run it (as root!) inside seccomp, and watch it
possibly develop security holes. If the alternate execve is a
different syscall, then this can't happen. And if someone remaps
execve to execve_nosecurity (from userspace or via some in-kernel
mechanism) and causes problems, it's entirely clear who to blame.

--Andy

Will Drewry

unread,

Jan 12, 2012, 12:10:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 10:18 AM, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace. Once a task-local filter program is attached from a
>> process without privileges, execve will fail. This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> I think this model is wrong. The rest of the policy rules all work on the
> basis that dumpable is the decider (the same rules for not dumping, not
> tracing, etc). A user should be able to apply filter to their own code
> arbitarily. Any setuid app should IMHO lose the trace subject to the usual
> uid rules and capability rules. That would seem to be more flexible and
> also the path of least surprise.

My line of thinking up to now has been that disallowing setuid exec
would mean there is no risk of an errant setuid binary allowing escape
from the system call filters (which the containers people may care
more about). Since setuid is privilege escalation, then perhaps it
makes sense to allow it as an escape hatch.

Would it be sane to just disallow setuid exec exclusively?

> [plus you can implement non setuid exec entirely in userspace so it's
> a rather meaningless distinction you propose]

Agreed.

>> be tackled separately via separate patchsets. (And at some point sharing
>> BPF JIT code!)
>
> A BPF jit ought to be trivial and would be a big win.
>
> In general I like this approach. It's simple, it's compact and it offers
> interesting possibilities for solving some interesting problem spaces,
> without the full weight of SELinux, SMACK etc which are still needed for
> heavyweight security.
>

Thanks! Yeah I think merging with the network stack is eminently
doable, but I didn't want to bog down the proposal in how much
overhead I might be adding to the network layer.

Will Drewry

unread,

Jan 12, 2012, 12:10:02 PM1/12/12

to

It shouldn't be. It's another relic I missed from development. (Adding to v3 :)

Andrew Lutomirski

unread,

Jan 12, 2012, 12:10:03 PM1/12/12

to

On Wed, Jan 11, 2012 at 9:25 AM, Will Drewry <w...@chromium.org> wrote:
> This patch adds support for seccomp mode 2. This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task. The policy is expressed in terms of a BPF program,
> as is used for userland-exposed socket filtering. Instead of network
> data, the BPF program is evaluated over struct user_regs_struct at the
> time of the system call (as retrieved using regviews).
>

There's some seccomp-related code in the vsyscall emulation path in
arch/x86/kernel/vsyscall_64.c. How should time(), getcpu(), and
gettimeofday() be handled? If you want filtering to work, there
aren't any real syscall registers to inspect, but they could be
synthesized.

Preventing a malicious task from figuring out approximately what time
it is is basically impossible because of the way that vvars work. I
don't know how to change that efficiently.

--Andy

Alan Cox

unread,

Jan 12, 2012, 12:20:02 PM1/12/12

to

> more about). Since setuid is privilege escalation, then perhaps it
> makes sense to allow it as an escape hatch.
>
> Would it be sane to just disallow setuid exec exclusively?

I think that is a policy question. I can imagine cases where either
behaviour is the "right" one so it may need to be a parameter ?

Alan

Will Drewry

unread,

Jan 12, 2012, 12:20:02 PM1/12/12

to

So on x86 32, it would work since user_regs_struct == task_pt_regs
(iirc), but on x86-64
and others, that's not true. I don't think it's kosher to expose
pt_regs to the userspace, but if, let's say, x86-32 overrides the weak
linkage, then it could just return task_pt_regs and be the fastest
path.

If it would be appropriate to expose pt_regs to userspace, then I'd
happily do so :)

Steven Rostedt

unread,

Jan 12, 2012, 12:20:02 PM1/12/12

to

On Thu, 2012-01-12 at 09:09 -0800, Linus Torvalds wrote:

> The whole "fail security escalations" thing goes way beyond just
> filtering, I think we could seriously try to make it a generic
> feature.

After I wrote this comment I thought the same thing. It would be nice to
have a way to just set a flag to a process that will prevent it from
doing any escalating of privileges.

I totally agree, this would solve a whole host of issues with regard to
security issues in things that shouldn't be a problem but currently are.

Randy Dunlap

unread,

Jan 12, 2012, 12:20:02 PM1/12/12

to

On 01/11/2012 03:19 PM, Will Drewry wrote:
> Document how system call filtering with BPF works and
> may be used. Includes an example for x86 (32-bit).

Please tell some of us what "BPF" means. wikipedia lists 15 possible
choices, but I don't know which one to choose.

> Signed-off-by: Will Drewry <w...@chromium.org>
> ---
> Documentation/prctl/seccomp_filter.txt | 99 ++++++++++++++++++++++++++++++++
> samples/Makefile | 2 +-
> samples/seccomp/Makefile | 12 ++++
> samples/seccomp/bpf-example.c | 74 ++++++++++++++++++++++++
> 4 files changed, 186 insertions(+), 1 deletions(-)
> create mode 100644 Documentation/prctl/seccomp_filter.txt
> create mode 100644 samples/seccomp/Makefile
> create mode 100644 samples/seccomp/bpf-example.c

--
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

Linus Torvalds

unread,

Jan 12, 2012, 12:20:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 8:27 AM, Steven Rostedt <ros...@goodmis.org> wrote:
>
> In that case, just have execv fail if filtering is enabled and we are
> execing a setuid program. But I don't see why non "magical" execv's
> should be prohibited.

The whole "fail security escalations" thing goes way beyond just
filtering, I think we could seriously try to make it a generic
feature.

For example, somebody just asked me the other day why "chroot()"
requires admin privileges, since it would be good to limit even
non-root things.

And it's really the exact same issue as filtering: in some sense,
chroot() "filters" FS name lookups, and can be used to fool programs
that are written to be secure.

We could easily introduce a per-process flag that just says "cannot
escalate privileges". Which basically just disables execve() of
suid/sgid programs (and possibly other things too), and locks the
process to the current privileges. And then make the rule be that *if*
that flag is set, you can then filter across an execve, or chroot as a
normal user, or whatever.

There are probably other things like that - things like allowing users
to do bind mounts etc - that aren't dangerous in themselves, but that
are dangerous mainly because they can be used to fool things into
privilege escalations. So this is definitely not a filter-only issue.

Linus

Will Drewry

unread,

Jan 12, 2012, 12:30:01 PM1/12/12

to

On Thu, Jan 12, 2012 at 12:16 PM, Randy Dunlap <rdu...@xenotime.net> wrote:
> On 01/11/2012 03:19 PM, Will Drewry wrote:
>> Document how system call filtering with BPF works and
>> may be used. Includes an example for x86 (32-bit).
>
> Please tell some of us what "BPF" means. wikipedia lists 15 possible
> choices, but I don't know which one to choose.

I'll make it clearer in the documentation file and update the patch description.

BPF == Berkeley Packet Filters which are implemented in Linux Socket
Filters (LSF)>

thanks!

Will Drewry

unread,

Jan 12, 2012, 12:30:01 PM1/12/12

to

On Thu, Jan 12, 2012 at 7:13 AM, Łukasz Sowa <luk...@gmail.com> wrote:
> Hi Will,
>
> That's very different approach to the system call interposition problem.
> I find you solution very interesting. It gives far more capabilities
> than my syscalls cgroup that you commented on some time ago. It's ready
> now but I haven't tried filtering yet. I think that if your solution
> make it to the mainline (and I guess that's really possible at current
> stage :)), there will be no place for mine solution but that's ok.

Yeah - there've been so many tries, I'll be happy when one makes it in
which is usable :)

> There's one thing that I'm curious about - have you measured overhead in
> any way? That was one of the biggest issues in all previous attempts to
> limit syscalls. I'd love to compare the numbers with mine solution.

Certainly. I have some rough numbers, but nothing I'd call strong
measurements. There is still a fair amount of cost due to the syscall
slow path.

> I'll examine your patch later on and put some comments if I bump into
> something.

Much appreciated - cheers!
will

Jamie Lokier

unread,

Jan 12, 2012, 12:40:01 PM1/12/12

to

Steven Rostedt wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
> > Filter programs may _only_ cross the execve(2) barrier if last filter
> > program was attached by a task with CAP_SYS_ADMIN capabilities in its
> > user namespace. Once a task-local filter program is attached from a
> > process without privileges, execve will fail. This ensures that only
> > privileged parent task can affect its privileged children (e.g., setuid
> > binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?

Ugly method: Using ptrace(), trap after the execve() and issue fake
syscalls to install the filter. I feel dirty thinking it, in a good way.

LD_PRELOAD has been suggested. It's not 100% reliable because not all
executables are dynamic (on some uClinux platforms none of them are),
but it will usually work.

-- Jamie

Will Drewry

unread,

Jan 12, 2012, 12:40:01 PM1/12/12

to

On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <ja...@shareable.org> wrote:
> Will Drewry wrote:
>> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <ros...@goodmis.org> wrote:
>> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>> >
>> >> Filter programs may _only_ cross the execve(2) barrier if last filter
>> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> >> user namespace. Once a task-local filter program is attached from a
>> >> process without privileges, execve will fail. This ensures that only
>> >> privileged parent task can affect its privileged children (e.g., setuid
>> >> binary).
>> >
>> > This means that a non privileged user can not run another program with
>> > limited features? How would a process exec another program and filter
>> > it? I would assume that the filter would need to be attached first and
>> > then the execv() would be performed. But after the filter is attached,
>> > the execv is prevented?
>>
>> Yeah - it means tasks can filter themselves, but not each other.
>> However, you can inject a filter for any dynamically linked executable
>> using LD_PRELOAD.
>>
>> > Maybe I don't understand this correctly.
>>
>> You're right on. This was to ensure that one process didn't cause
>> crazy behavior in another. I think Alan has a better proposal than
>> mine below. (Goes back to catching up.)
>

> You can already use ptrace() to cause crazy behaviour in another
> process, including modifying registers arbitrarily at syscall entry
> and exit, aborting and emulating syscalls.
>
> ptrace() is quite slow and it would be really nice to speed it up,
> especially for trapping a small subset of syscalls, or limiting some
> kinds of access to some file descriptors, while everything else runs
> at normal speed.
>
> Speeding up ptrace() with BPF filters would be a really nice. Not
> that I like ptrace(), but sometimes it's the only thing you can rely on.
>
> LD_PRELOAD and code running in the target process address space can't
> always be trusted in some contexts (e.g. the target process may modify
> the tracing code or its data); whereas ptrace() is pretty complete and
> reliable, if ugly.
>
> There's already a security model around who can use ptrace(); speeding
> it up needn't break that.
>
> If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
> needed as userspace could have done it, with exactly the restrictions
> it wants. Google's NaCl comes to mind as a potential user.

That's not entirely true. ptrace supervisors are subject to races and
always fail open. This makes them effective but not as robust as a
seccomp solution can provide.

With seccomp, it fails close. What I think would make sense would be
to add a user-controllable failure mode with seccomp bpf that calls
tracehook_ptrace_syscall_entry(regs). I've prototyped this and it
works quite well, but I didn't want to conflate the discussions.

Using ptrace() would also mean that all consumers of this interface
would need a supervisor, but with seccomp, the filters are installed
and require no supervisors to stick around for when failure occurs.

Does that make sense?
thanks!
will

Steven Rostedt

unread,

Jan 12, 2012, 12:40:01 PM1/12/12

to

On Thu, 2012-01-12 at 11:23 -0600, Will Drewry wrote:

> > Please tell some of us what "BPF" means. wikipedia lists 15 possible
> > choices, but I don't know which one to choose.
>
> I'll make it clearer in the documentation file and update the patch description.
>
> BPF == Berkeley Packet Filters which are implemented in Linux Socket
> Filters (LSF)>
>

I admit, I was totally clueless in what it meant too ;)

Even the LWN article didn't explain (shame on you Jon).

"he has repurposed the networking layer's packet filtering mechanism
(BPF)"

I didn't know what did the "B" stood for.

-- Steve

Jamie Lokier

unread,

Jan 12, 2012, 12:40:01 PM1/12/12

to

Steven Rostedt wrote:
> On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:
>
> > May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> > cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
> >
> > OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
>
> I've never used seccomp, so I admit I'm totally ignorant on this topic.
>
> But looking at seccomp from the outside, the biggest advantage to this
> would be the ability for normal processes to be able to limit tasks it
> kicks off. If I want to run a task in a sandbox, I don't want to be root
> to do so.
>
> I guess a web browser doesn't perform an exec to run java programs.

Actually it does. Firefox on Linux forks and execs the Java VM.
Same for Flash, using "plugin-container".

> But it would be nice if I could execute something from the command
> line that I could run in a sand box.

You can do this now, using ptrace(). It's horrible, but half of the
horribleness is needing to understand machine-dependent registers,
which this new patch doesn't address. (The other half is a ton of
undocumented but important ptrace() behaviours on Linux.)

-- Jamie

Oleg Nesterov

unread,

Jan 12, 2012, 12:40:02 PM1/12/12

to

On 01/12, Will Drewry wrote:
>
> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <ol...@redhat.com> wrote:
> >> + */
> >> + regs = seccomp_get_regs(regs_tmp, &regs_size);
> >
> > Stupid question. I am sure you know what are you doing ;) and I know
> > nothing about !x86 arches.
> >
> > But could you explain why it is designed to use user_regs_struct ?
> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>
> So on x86 32, it would work since user_regs_struct == task_pt_regs
> (iirc), but on x86-64
> and others, that's not true.

Yes sure, I meant that userpace should use pt_regs too.

> If it would be appropriate to expose pt_regs to userspace, then I'd
> happily do so :)

Ah, so that was the reason. But it is already exported? At least I see
the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.

Once again, I am not arguing, just trying to understand. And I do not
know if this definition is part of abi.

Oleg.

Jamie Lokier

unread,

Jan 12, 2012, 12:50:01 PM1/12/12

to

Steven Rostedt wrote:

> On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:
>
> > You can do this now, using ptrace(). It's horrible, but half of the
> > horribleness is needing to understand machine-dependent registers,
> > which this new patch doesn't address. (The other half is a ton of
> > undocumented but important ptrace() behaviours on Linux.)
>

> Yeah I know the horrid use of ptrace, I've implemented programs that use
> it :-p

That warm fuzzy feeling :-)

> I guess ptrace can capture the execv and determine if it is OK or not to
> run it. But again, this doesn't stop the possible attacks that could
> happen, with having the execv on a symlink file, having the ptrace check
> say its OK, and then switching the symlink to a setuid file.
>
> When the new execv executed, the parent process would lose all control
> over it. The idea is to prevent this.

fexecve() exists to solve the problem.
Also known as execve("/proc/self/fd/...") on Linux.

> I like Alan's suggestion. Have userspace decide to allow execv or not,
> and even let it decide if it should allow setuid execv's or not, but
> still allow non-setuid execvs. If you allow the setuid execv, once that
> happens, the same behavior will occur as with ptrace. A setuid execv
> will lose all its filtering.

I like the idea of letting the tracer decide what it wants.

Steven Rostedt

unread,

Jan 12, 2012, 12:50:01 PM1/12/12

to

On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:

> You can do this now, using ptrace(). It's horrible, but half of the
> horribleness is needing to understand machine-dependent registers,
> which this new patch doesn't address. (The other half is a ton of
> undocumented but important ptrace() behaviours on Linux.)

Yeah I know the horrid use of ptrace, I've implemented programs that use
it :-p

I guess ptrace can capture the execv and determine if it is OK or not to
run it. But again, this doesn't stop the possible attacks that could
happen, with having the execv on a symlink file, having the ptrace check
say its OK, and then switching the symlink to a setuid file.

When the new execv executed, the parent process would lose all control
over it. The idea is to prevent this.

I like Alan's suggestion. Have userspace decide to allow execv or not,
and even let it decide if it should allow setuid execv's or not, but
still allow non-setuid execvs. If you allow the setuid execv, once that
happens, the same behavior will occur as with ptrace. A setuid execv
will lose all its filtering.

-- Steve

Jamie Lokier

unread,

Jan 12, 2012, 1:00:02 PM1/12/12

to

What races do you know about?

I'm not aware of any ptrace races if it's used properly. I'm also not
sure what you mean by fail open/close here, unless you mean the target
process gets to carry on if the tracing process dies.

Having said that, I can think of one race, but I think your BPF scheme
has the same one: After checking the syscall's string arguments and
other pointed to data, another thread can change those arguments
before the real syscall uses them.

> With seccomp, it fails close. What I think would make sense would be
> to add a user-controllable failure mode with seccomp bpf that calls
> tracehook_ptrace_syscall_entry(regs). I've prototyped this and it
> works quite well, but I didn't want to conflate the discussions.

It think it's a nice idea. While you're at it could you fix all the
architectures to actually use tracehooks for syscall tracing ;-)

(I think it's ok to call the tracehook function on all archs though.)

> Using ptrace() would also mean that all consumers of this interface
> would need a supervisor, but with seccomp, the filters are installed
> and require no supervisors to stick around for when failure occurs.
>
> Does that make sense?

It does, I agree that ptrace() is quite cumbersome and you don't
always want a separate tracing process, especially if "failure" means
to die or get an error.

On the other hand, sometimes when a failure occurs, having another
process decide what to do, or log the event, is exactly what you want.

For my nefarious purposes I'm really just looking for a faster way to
reliably trace some activities of individual processes, in particular
tracking which files they access. I'd rather not interfere with
debuggers, so I'd really like your ability to stack multiple filters
to work with separate-process tracing as well. And I'd happily use a
filter rule which can dump some information over a pipe, without
waiting for the tracer to respond in most cases.

-- Jamie

Will Drewry

unread,

Jan 12, 2012, 1:00:03 PM1/12/12

to

On Thu, Jan 12, 2012 at 11:11 AM, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>> more about). Since setuid is privilege escalation, then perhaps it
>> makes sense to allow it as an escape hatch.
>>
>> Would it be sane to just disallow setuid exec exclusively?
>
> I think that is a policy question. I can imagine cases where either
> behaviour is the "right" one so it may need to be a parameter ?

Makes sense. I'll make it flaggable (ignoring the parallel conversation
about having a thread-wide suidable bit).

thanks!

Steven Rostedt

unread,

Jan 12, 2012, 1:00:03 PM1/12/12

to

On Thu, 2012-01-12 at 17:44 +0000, Jamie Lokier wrote:

> > I like Alan's suggestion. Have userspace decide to allow execv or not,
> > and even let it decide if it should allow setuid execv's or not, but
> > still allow non-setuid execvs. If you allow the setuid execv, once that
> > happens, the same behavior will occur as with ptrace. A setuid execv
> > will lose all its filtering.
>
> I like the idea of letting the tracer decide what it wants.

Right, and if we implement the suggestion that Linus made, to set a flag
to prevent a task from every getting privilege, then seccomp can add
that too.

That is, there can be a filter to say "prevent this task from doing
anything with privilege" and that will prevent execv from gaining setuid
privilege. Perhaps, it would still do the execv, but the program that is
executed will run as the normal user, and just fail when it tries to do
something that requires sys admin privilege.

Thus, execv will not be a "special" case here. Seccomp either allows it
or not. But also add a command to tell seccomp that this task will not
be allowed to do anything privileged.

-- Steve

Will Drewry

unread,

Jan 12, 2012, 1:00:04 PM1/12/12

to

On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <ol...@redhat.com> wrote:
> On 01/12, Will Drewry wrote:
>>
>> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <ol...@redhat.com> wrote:
>> >> + */
>> >> + regs = seccomp_get_regs(regs_tmp, &regs_size);
>> >
>> > Stupid question. I am sure you know what are you doing ;) and I know
>> > nothing about !x86 arches.
>> >
>> > But could you explain why it is designed to use user_regs_struct ?
>> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>>
>> So on x86 32, it would work since user_regs_struct == task_pt_regs
>> (iirc), but on x86-64
>> and others, that's not true.
>
> Yes sure, I meant that userpace should use pt_regs too.
>
>> If it would be appropriate to expose pt_regs to userspace, then I'd
>> happily do so :)
>
> Ah, so that was the reason. But it is already exported? At least I see
> the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>
> Once again, I am not arguing, just trying to understand. And I do not
> know if this definition is part of abi.

I don't either :/ My original idea was to operate on task_pt_regs(current),
but I noticed that PTRACE_GETREGS/SETREGS only uses the
user_regs_struct. So I went that route.

I'd love for pt_regs to be fair game to cut down on the copying!
will

Will Drewry

unread,

Jan 12, 2012, 1:10:01 PM1/12/12

to

I'm pretty sure that if you have two "isolated" processes, they could
cause irregular behavior using signals.

> I'm not aware of any ptrace races if it's used properly. I'm also not
> sure what you mean by fail open/close here, unless you mean the target
> process gets to carry on if the tracing process dies.

Exactly. Security systems that, on failure, allow the action to
proceed can't be relied on.

> Having said that, I can think of one race, but I think your BPF scheme
> has the same one: After checking the syscall's string arguments and
> other pointed to data, another thread can change those arguments
> before the real syscall uses them.

Not a problem - BPF only allows register inspection. No TOCTOU attacks
need apply :D

>> With seccomp, it fails close. What I think would make sense would be
>> to add a user-controllable failure mode with seccomp bpf that calls
>> tracehook_ptrace_syscall_entry(regs). I've prototyped this and it
>> works quite well, but I didn't want to conflate the discussions.
>
> It think it's a nice idea. While you're at it could you fix all the
> architectures to actually use tracehooks for syscall tracing ;-)
>
> (I think it's ok to call the tracehook function on all archs though.)
>
>> Using ptrace() would also mean that all consumers of this interface
>> would need a supervisor, but with seccomp, the filters are installed
>> and require no supervisors to stick around for when failure occurs.
>>
>> Does that make sense?
>
> It does, I agree that ptrace() is quite cumbersome and you don't
> always want a separate tracing process, especially if "failure" means
> to die or get an error.
>
> On the other hand, sometimes when a failure occurs, having another
> process decide what to do, or log the event, is exactly what you want.
>
> For my nefarious purposes I'm really just looking for a faster way to
> reliably trace some activities of individual processes, in particular
> tracking which files they access. I'd rather not interfere with
> debuggers, so I'd really like your ability to stack multiple filters
> to work with separate-process tracing as well. And I'd happily use a
> filter rule which can dump some information over a pipe, without
> waiting for the tracer to respond in most cases.

Cool - if the rest of this discussion proceeds, then hopefully, we can
move towards discussing if tying it with ptrace is a good idea or a
horrible one :)

thanks!

Jamie Lokier

unread,

Jan 12, 2012, 1:10:01 PM1/12/12

to

-- Jamie

Andrew Lutomirski

unread,

Jan 12, 2012, 1:20:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 9:09 AM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Thu, Jan 12, 2012 at 8:27 AM, Steven Rostedt <ros...@goodmis.org> wrote:
>>
>> In that case, just have execv fail if filtering is enabled and we are
>> execing a setuid program. But I don't see why non "magical" execv's
>> should be prohibited.
>
> The whole "fail security escalations" thing goes way beyond just
> filtering, I think we could seriously try to make it a generic
> feature.
>
> For example, somebody just asked me the other day why "chroot()"
> requires admin privileges, since it would be good to limit even
> non-root things.
>
> And it's really the exact same issue as filtering: in some sense,
> chroot() "filters" FS name lookups, and can be used to fool programs
> that are written to be secure.
>
> We could easily introduce a per-process flag that just says "cannot
> escalate privileges". Which basically just disables execve() of
> suid/sgid programs (and possibly other things too), and locks the
> process to the current privileges. And then make the rule be that *if*
> that flag is set, you can then filter across an execve, or chroot as a
> normal user, or whatever.

Like this?

http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html

(This depends on execve_nosecurity, which is controversial, but that
dependency would be trivial to remove.)

Note that there's a huge can of worms if execve is allowed but
suid/sgid is not: selinux may elevate privileges on exec of pretty
much anything. (I think that this is a really awful idea, but it's in
the kernel, so we're stuck with it.)

--Andy

Linus Torvalds

unread,

Jan 12, 2012, 1:40:01 PM1/12/12

to

On Thu, Jan 12, 2012 at 10:18 AM, Andrew Lutomirski <lu...@mit.edu> wrote:
>
> Like this?
>
> http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html

I don't know the execve_nosecurity patches, so the diff makes little
sense to me, but yeah, I wouldn't expect it to be more than a couple
of lines. Exactly *how* you set the bit etc is not something I care
deeply about, prctl seems about as good as anything.

> Note that there's a huge can of worms if execve is allowed but
> suid/sgid is not: selinux may elevate privileges on exec of pretty
> much anything. (I think that this is a really awful idea, but it's in
> the kernel, so we're stuck with it.)

You can do any amount of crazy things with selinux, but the other side
of the coin is that it would also be trivial to teach selinux about
this same "restricted environment" bit, and just say that a process
with that bit set doesn't get to match whatever selinux privilege
escalation rules..

I really don't think this is just about "execve cannot do setuid". I
think it's about the process being marked as restricted.

So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
should simply be "PR_RESTRICT_ME", and be done with it, and not try to
artificially limit it to be some "execve feature", and more think of
it as a "this is a process that has *no* extra privileges at all, and
can never get them".

Linus

Andrew Lutomirski

unread,

Jan 12, 2012, 1:50:02 PM1/12/12

to

Fair enough. I'll submit the simpler patch tonight.

execve_nosecurity was my attempt to sidestep selinux issues. It's a
different syscall that does all of the non-security-related things
that execve does but does not escalate (or even change) any
privileges. Maybe I'll try to rework that for newer kernels as well.
The idea is that programs that expect to run in sandboxes / chroots /
namespaces / whatever can use it, and older programs that might
malfunction dangerously if the semantics of execve change will just
fail instead.

--Andy

Kyle Moffett

unread,

Jan 12, 2012, 2:10:02 PM1/12/12

to

> execve_nosecurity was my attempt to sidestep selinux issues. It's a
> different syscall that does all of the non-security-related things
> that execve does but does not escalate (or even change) any
> privileges. Maybe I'll try to rework that for newer kernels as well.
> The idea is that programs that expect to run in sandboxes / chroots /
> namespaces / whatever can use it, and older programs that might
> malfunction dangerously if the semantics of execve change will just
> fail instead.

I don't see any issues with SELinux support for this feature.

Specifically, when you try to execute something in SELinux, it will
first look at the types and try to "execute" (involving a type
transition IE: security label change).

But if that fails in many cases it may still be allowed to
"execute_no_trans" (IE: regular non-privileged exec() without a
transition).

If you add this feature, it should just disable the normal "execute"
with transition path and unconditionally fall back to
"execute_no_trans".

Likewise, enabling these bits should also disable the "transition" and
"dyntransition" process access vectors, and I'm on the fence about
whether "setfscreate", etc should be allowed.

Cheers,
Kyle Moffett

--
Curious about my work on the Debian powerpcspe port?
I'm keeping a blog here: http://pureperl.blogspot.com/

Andrew Lutomirski

unread,

Jan 12, 2012, 2:50:01 PM1/12/12

to

On Thu, Jan 12, 2012 at 11:40 AM, Will Drewry <w...@chromium.org> wrote:

> This sounds cool. Do you think you'll go for a new task_struct member
> or will it a securebit? (Seems like securebits might be too tied to
> posix file caps, but I figured I'd ask).
>
> I'm planning on going ahead and mocking up your potential patch so I
> can respin this series using it and make sure I understand the
> interactions.

I think securebits and cred didn't exist the first time I did this,
and sticking it in struct cred might unnecessarily prevent sharing
cred (assuming that even happens). So I'd say task_struct.

Will Drewry

unread,

Jan 12, 2012, 2:50:02 PM1/12/12

to

This sounds cool. Do you think you'll go for a new task_struct member
or will it a securebit? (Seems like securebits might be too tied to
posix file caps, but I figured I'd ask).

I'm planning on going ahead and mocking up your potential patch so I
can respin this series using it and make sure I understand the
interactions.

thanks!
will

Will Drewry

unread,

Jan 12, 2012, 2:50:02 PM1/12/12

to

Or cred member, etc.

Linus Torvalds

unread,

Jan 12, 2012, 3:10:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 11:46 AM, Andrew Lutomirski <lu...@mit.edu> wrote:
>
> I think securebits and cred didn't exist the first time I did this,
> and sticking it in struct cred might unnecessarily prevent sharing
> cred (assuming that even happens). So I'd say task_struct.

I think it almost has to be task state, since we very much want to
make sure it's trivial to see that nothing ever clears that bit, and
that it always gets copied right over a fork/exec/whatever.

Putting it in some cred or capability bit or somethin would make that
kind of transparency pretty much totally impossible.

Linus

Will Drewry

unread,

Jan 12, 2012, 5:20:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 11:40 AM, Steven Rostedt <ros...@goodmis.org> wrote:
> On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:
>
>> You can do this now, using ptrace(). It's horrible, but half of the
>> horribleness is needing to understand machine-dependent registers,
>> which this new patch doesn't address. (The other half is a ton of
>> undocumented but important ptrace() behaviours on Linux.)
>
> Yeah I know the horrid use of ptrace, I've implemented programs that use
> it :-p
>
> I guess ptrace can capture the execv and determine if it is OK or not to
> run it. But again, this doesn't stop the possible attacks that could
> happen, with having the execv on a symlink file, having the ptrace check
> say its OK, and then switching the symlink to a setuid file.
>
> When the new execv executed, the parent process would lose all control
> over it. The idea is to prevent this.
>
> I like Alan's suggestion. Have userspace decide to allow execv or not,
> and even let it decide if it should allow setuid execv's or not, but
> still allow non-setuid execvs. If you allow the setuid execv, once that
> happens, the same behavior will occur as with ptrace. A setuid execv
> will lose all its filtering.

In the ptrace case, doesn't it just downgrade the privileges of the new process
if there is a tracer, rather than detach the tracer?

Ignoring that, I've been looking at system call filters as being equivalent to
something like the caps bounding set. Once reduced, there's no going
back. I think Linus's proposal perfectly resolves the policy decision around
suid execution behavior in the run-with-privs or not scenarios (just like with
how ptrace does it). However, I'd like to avoid allowing any process to
escape system call filters once installed. (It's doable to add
suid/caps-based-bypass, but it certainly not ideal from my perspective.)

cheers,
will

Andrew Lutomirski

unread,

Jan 12, 2012, 6:10:02 PM1/12/12

to

I agree.

In principle, it could be safe for an outside (non-seccomp) process
with appropriate credentials to lift seccomp restrictions from a
different process. But why?

--Andy

Eric Paris

unread,

Jan 12, 2012, 6:10:02 PM1/12/12

to

On Thu, 2012-01-12 at 14:08 -0500, Kyle Moffett wrote:
> On Thu, Jan 12, 2012 at 13:44, Andrew Lutomirski <lu...@mit.edu> wrote:
> > On Thu, Jan 12, 2012 at 10:32 AM, Linus Torvalds <torv...@linux-foundation.org> wrote:

> >> You can do any amount of crazy things with selinux, but the other side
> >> of the coin is that it would also be trivial to teach selinux about
> >> this same "restricted environment" bit, and just say that a process
> >> with that bit set doesn't get to match whatever selinux privilege
> >> escalation rules..

> I don't see any issues with SELinux support for this feature.
>
> Specifically, when you try to execute something in SELinux, it will
> first look at the types and try to "execute" (involving a type
> transition IE: security label change).
>
> But if that fails in many cases it may still be allowed to
> "execute_no_trans" (IE: regular non-privileged exec() without a
> transition).

That's not true. See specifically
security/selinux/hooks.c::selinux_bprm_set_creds() We calculate a label
for the new task (that may or may not be the same) and then check if
there is permission to run the new binary with the new label. There is
no fallback.

The exception would be if the binary is on a MNT_NOSUID mount point, in
which case we calculate the new label, then just revert to the same
label.

At first glance it looks to me like a reasonable way to implement this
at first would be to do the new checks right next to any place we
already do MNT_NOSUID checks and mimic their behavior. If there are
other priv escalation points in the kernel we might need to consider if
MNT_NOSUID is adequate....

-Eric

Will Drewry

unread,

Jan 12, 2012, 6:40:02 PM1/12/12

to

Documents how system call filtering using Berkeley Packet
Filter programs works and how it may be used.
Includes an example for x86 (32-bit).

v3: - call out BPF <-> Berkeley Packet Filter (rdu...@xenotime.net)
- document use of tentative always-unprivileged
- guard sample compilation for i386 and x86_64
v2: - move code to samples (cor...@lwn.net)

Signed-off-by: Will Drewry <w...@chromium.org>
---
Documentation/prctl/seccomp_filter.txt | 94 ++++++++++++++++++++++++++++++++
samples/Makefile | 2 +-
samples/seccomp/Makefile | 18 ++++++
samples/seccomp/bpf-example.c | 74 +++++++++++++++++++++++++
4 files changed, 187 insertions(+), 1 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
create mode 100644 samples/seccomp/Makefile
create mode 100644 samples/seccomp/bpf-example.c

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..2db8b89
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,94 @@
+ Seccomp filtering
+ =================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls. The filter is expressed as a Berkeley Packet
+Filter (BPF) program, as with socket filters, except that the data
+operated on is the current user_regs_struct. This allows for expressive
+filtering of system calls using the pre-existing system call ABI and
+using a filter program language with a long history of being exposed to
+userland. Additionally, BPF makes it impossible for users of seccomp to
+fall prey to time-of-check-time-of-use (TOCTOU) attacks that are common
+in system call interposition frameworks because the evaluated data is
+solely register state just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing. Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by the
+consuming process. The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+ Allows the specification of a new filter using a BPF program.
+ The BPF program will be executed over a user_regs_struct data
+ reflecting system call time except with the system call number
+ resident in orig_[register]. To allow a system call, the size
+ of the data must be returned. At present, all other return values
+ result in the system call being blocked, but it is recommended to
+ return 0 in those cases. This will allow for future custom return
+ values to be introduced, if ever desired.
+
+ Usage:
+ prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+ The 'prog' argument is a pointer to a struct sock_fprog which will
+ contain the filter program. If the program is invalid, the call
+ will return -1 and set errno to -EINVAL.
+
+ The struct user_regs_struct the @prog will see is based on the
+ personality of the task at the time of this prctl call. Additionally,
+ is_compat_task is also tracked for the @prog. This means that once set
+ the calling task will have all of its system calls blocked if it
+ switches its system call ABI (via personality or other means).
+
+ If fork/clone and execve are allowed by @prog, any child processes will
+ be constrained to the same filters and syscal call ABI as the parent.
+
+ When called from an unprivileged process (lacking CAP_SYS_ADMIN), the
+ "always_unprivileged" bit is enabled for the process.
+
+ Additionally, if prctl(2) is allowed by the attached filter,
+ additional filters may be layered on which will increase evaluation
+ time, but allow for further decreasing the attack surface during
+ execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Example
+-------
+
+samples/seccomp-bpf-example.c shows an example process that allows read from stdin,
+write to stdout/err, exit and signal returns for 32-bit x86.
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
# Makefile for Linux samples code

obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..cdf0282
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,18 @@
+# This sample is x86-only.
+ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-y := bpf-example
+bpf-example-objs := bpf-example.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-example.o += -I$(objtree)/usr/include
+ifeq ($(KBUILD_BUILDHOST),x86_64)
+HOSTCFLAGS_bpf-example.o += -m32
+HOSTLOADLIBES_bpf-example += -m32
+endif
+endif # host arch is x86
diff --git a/samples/seccomp/bpf-example.c b/samples/seccomp/bpf-example.c
new file mode 100644
index 0000000..f98b70a
--- /dev/null
+++ b/samples/seccomp/bpf-example.c
@@ -0,0 +1,74 @@
+/*
+ * Seccomp BPF example
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromiu...@chromium.org>
+ * Author: Will Drewry <w...@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/prctl.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+# define PR_ATTACH_SECCOMP_FILTER 36
+#endif
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+static int install_filter(void)
+{
+ struct sock_filter filter[] = {
+ /* Grab the system call number */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
+ /* Jump table for the allowed syscalls */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+ /* Check that read is only using stdin. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+ /* Check that write is only using stdout/stderr */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+ /* Put the "accept" value in A */
+ BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+ BPF_STMT(BPF_RET+BPF_A,0),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ return 0;
+}
--
1.7.5.4

Linus Torvalds

unread,

Jan 12, 2012, 6:40:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 3:27 PM, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>
> It's a change of user context. Things like ptrace and file permissions
> basically mean you can't build a barrier between stuff running as the
> same uid to a great extent except with heavy restricting, but saying
> "you can't become someone else" is very useful.

Not just "someone else".

The guarantee basically has to be "you can't change your security
context". Where "become somebody else" is part of it, but any
capability changes etc would be part of it too. So it should disable
all games with capabilities etc.

And I don't think selinux really should be all that much of a problem
- we should just make sure that selinux would honor such a bit, and
refuse to do any op that would change any selinux capabilities either.
Same goes for other security models.

And that may include restricting the ways a binary can be executed
totally outside of suid/sgid bits. For example, if you consider
binaries under /home to have different selinxu rules than system
binaries in /usr/bin, then a cross-execute from one to the other may
not work, regardless of whether it's suid or not.

I think that is the kind of guarantee a sandbox environment really
wants: "I'm setting up a sandbox, you'd better not change the
permissions on me regardless of what crazy things I do".

Linus

Andrew Lutomirski

unread,

Jan 12, 2012, 6:40:02 PM1/12/12

to

I don't really like the current logic. It does:

if (old_tsec->exec_sid) {
new_tsec->sid = old_tsec->exec_sid;
/* Reset exec SID on execve. */
new_tsec->exec_sid = 0;
} else {
/* Check for a default transition on this program. */
rc = security_transition_sid(old_tsec->sid, isec->sid,
SECCLASS_PROCESS, NULL,
&new_tsec->sid);
if (rc)
return rc;
}

COMMON_AUDIT_DATA_INIT(&ad, PATH);
ad.u.path = bprm->file->f_path;

if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
new_tsec->sid = old_tsec->sid;

which means that, if MNT_NOSUD, then exec_sid is silently ignored.
I'd rather fail in that case, but it's probably too late for that.
However, if we set the "no new privileges" flag, then we could fail,
since there's no old ABI to be compatible with. I'll implement it
that way.

--Andy

Alan Cox

unread,

Jan 12, 2012, 6:40:02 PM1/12/12

to

> Thus, execv will not be a "special" case here. Seccomp either allows it
> or not. But also add a command to tell seccomp that this task will not
> be allowed to do anything privileged.

A setuid binary is not necessarily priviledged - indeed a root -> user
transition via setuid is pretty much the reverse.

It's a change of user context. Things like ptrace and file permissions
basically mean you can't build a barrier between stuff running as the
same uid to a great extent except with heavy restricting, but saying
"you can't become someone else" is very useful.

Alan

Will Drewry

unread,

Jan 12, 2012, 6:40:02 PM1/12/12

to

This patch adds support for seccomp mode 2. This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified
by a userland task. The policy is expressed in terms of a BPF program,
as is used for userland-exposed socket filtering. Instead of network
data, the BPF program is evaluated over struct user_regs_struct at the
time of the system call (as retrieved using regviews).

A filter program may be installed by a userland task by calling
prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.

If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached. All attached programs
must be evaluated before a system call will be allowed to proceed.

To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() and current->personality, it
is not allowed to make system calls or attach additional filters which
use a different combination of is_compat_task() and
current->personality.

Filter programs are inherited across fork, clone, and execve, but if a
filter is attached by an unprivileged process (!CAP_SYS_ADMIN), the
"always unprivileged" process bit will be set. This ensures that
unprivileged processes may not enact unexpected system call filters on
processes that are run as different users or with different capabilities
unless the process already had that ability.

There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: expected register layout and system
call numbers.
- Full register information is provided which may be relevant for
certain syscalls (fork, rt_sigreturn) or for other userland
filtering tactics (checking the PC).
- No time-of-check-time-of-use vulnerable data accesses are possible.

This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code. It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)

v3: - macros to inline (ol...@redhat.com)
- init_task behavior fixed (ol...@redhat.com)
- drop creator entry and extra NULL check (ol...@redhat.com)
- alloc returns -EINVAL on bad sizing (serge....@canonical.com)
- (!) adds tentative use of "always_unprivileged" as per
torv...@linux-foundation.org and lu...@mit.edu
v2: - n/a

Signed-off-by: Will Drewry <w...@chromium.org>
---

include/linux/prctl.h | 3 +
include/linux/seccomp.h | 68 +++++-
kernel/Makefile | 1 +
kernel/fork.c | 4 +
kernel/seccomp.c | 8 +
kernel/seccomp_filter.c | 620 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 4 +
security/Kconfig | 12 +
8 files changed, 717 insertions(+), 3 deletions(-)
create mode 100644 kernel/seccomp_filter.c

diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index d5d6ab6..a53fae3 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -64,6 +64,9 @@
#define PR_GET_SECCOMP 21
#define PR_SET_SECCOMP 22

+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER 36
+
/* Get/set the capability bounding set (as per security/commoncap.c) */
#define PR_CAPBSET_READ 23
#define PR_CAPBSET_DROP 24
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..0296871 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,9 +5,28 @@
#ifdef CONFIG_SECCOMP

#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>

-typedef struct { int mode; } seccomp_t;
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode:
+ * if this is 0, seccomp is not in use.
+ * is 1, the process is under standard seccomp rules.
+ * is 2, the process is only allowed to make system calls where
+ * associated filters evaluate successfully.
+ * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
+ * @filter must only be accessed from the context of current as there
+ * is no guard.
+ */
+typedef struct seccomp_struct {
+ int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+ struct seccomp_filter *filter;
+#endif
+} seccomp_t;

extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)

#include <linux/errno.h>

-typedef struct { } seccomp_t;
-
+typedef struct seccomp_struct { } seccomp_t;
#define secure_computing(x) do { } while (0)

static inline long prctl_get_seccomp(void)
@@ -49,4 +67,48 @@ static inline int seccomp_mode(seccomp_t *s)

#endif /* CONFIG_SECCOMP */

+#ifdef CONFIG_SECCOMP_FILTER
+
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_struct_fork(struct seccomp_struct *child,
+ const struct seccomp_struct *parent);
+
+static inline void seccomp_struct_init_task(struct seccomp_struct *seccomp)
+{
+ seccomp->mode = 0;
+ seccomp->filter = NULL;
+}
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+static inline void seccomp_struct_free_task(struct seccomp_struct *seccomp)
+{
+ put_seccomp_filter(seccomp->filter);
+ seccomp->filter = NULL;
+}
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+/* Macros consume the unused dereference by the caller. */
+#define seccomp_struct_init_task(_seccomp) do { } while (0);
+#define seccomp_struct_fork(_tsk, _orig) do { } while (0);
+#define seccomp_struct_free_task(_seccomp) do { } while (0);
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+ return -ENOSYS;
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..0584090 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index da4a6a1..22f7ec1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ seccomp_struct_free_task(&tsk->seccomp);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -1089,6 +1091,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto fork_out;

ftrace_graph_init_task(p);
+ seccomp_struct_init_task(&p->seccomp);

rt_mutex_init_task(p);

@@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (clone_flags & CLONE_THREAD)
threadgroup_fork_read_unlock(current);
perf_event_fork(p);
+ seccomp_struct_fork(&p->seccomp, &current->seccomp);
return p;

bad_fork_free_pid:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..78719be 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case 2:
+ if (seccomp_test_filters(this_syscall) == 0)
+ return;
+
+ seccomp_filter_log_failure(this_syscall);
+ break;
+#endif
default:
BUG();
}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..108a3f3
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,620 @@
+/* bpf program-based system call filtering
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromiu...@chromium.org>
+ */
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/security.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object lifetime.
+ * get/put helpers should be used when accessing an instance
+ * outside of a lifetime-guarded section. In general, this
+ * is only needed for handling filters shared across tasks.
+ * @parent: pointer to the ancestor which this filter will be composed with.
+ * @flags: provide information about filter from creation time.
+ * @personality: personality of the process at filter creation time.
+ * @insns: the BPF program instructions to evaluate
+ * @count: the number of instructions in the program.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+ struct kref usage;
+ struct seccomp_filter *parent;
+ struct {
+ uint32_t compat:1, /* CONFIG_COMPAT */
+ __reserved:31;
+ } flags;
+ int personality;
+ unsigned short count; /* Instruction count */
+ struct sock_filter insns[0];
+};
+
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *);
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ * @padding: size of the insns[0] array in bytes
+ *
+ * The @padding should be a multiple of
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
+{
+ struct seccomp_filter *f;
+ unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
+
+ /* Drop oversized requests. */
+ if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
+ return ERR_PTR(-EINVAL);
+
+ /* Padding should always be in sock_filter increments. */
+ if (padding % sizeof(struct sock_filter))
+ return ERR_PTR(-EINVAL);
+
+ f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
+ if (!f)
+ return ERR_PTR(-ENOMEM);
+ kref_init(&f->usage);
+ f->count = bpf_blocks;
+ return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ * @filter: NULL or live object to be completely destructed.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+ if (!filter)
+ return;
+ put_seccomp_filter(filter->parent);
+ kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+ struct seccomp_filter *orig =
+ container_of(kref, struct seccomp_filter, usage);
+ seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+ pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
+ current->comm, task_pid_nr(current), syscall,
+ KSTK_EIP(current));
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return;
+ kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return NULL;
+ kref_get(&orig->usage);
+ return orig;
+}
+
+static int seccomp_check_personality(struct seccomp_filter *filter)
+{
+ if (filter->personality != current->personality)
+ return -EACCES;
+#ifdef CONFIG_COMPAT
+ if (filter->flags.compat != (!!(is_compat_task())))
+ return -EACCES;
+#endif

+ return 0;
+}
+

+static const struct user_regset *
+find_prstatus(const struct user_regset_view *view)
+{
+ const struct user_regset *regset;
+ int n;
+
+ /* Skip 0. */
+ for (n = 1; n < view->n; ++n) {
+ regset = view->regsets + n;
+ if (regset->core_note_type == NT_PRSTATUS)
+ return regset;
+ }
+
+ return NULL;
+}
+
+/**
+ * seccomp_get_regs - returns a pointer to struct user_regs_struct
+ * @scratch: preallocated storage of size @available
+ * @available: pointer to the size of scratch.
+ *
+ * Returns NULL if the registers cannot be acquired or copied.
+ * Returns a populated pointer to @scratch by default.
+ * Otherwise, returns a pointer to a a u8 array containing the struct
+ * user_regs_struct appropriate for the task personality. The pointer
+ * may be to the beginning of @scratch or to an externally managed data
+ * structure. On success, @available should be updated with the
+ * valid region size of the returned pointer.
+ *
+ * If the architecture overrides the linkage, then the pointer may pointer to
+ * another location.
+ */
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+ /* regset is usually returned based on task personality, not current
+ * system call convention. This behavior makes it unsafe to execute
+ * BPF programs over regviews if is_compat_task or the personality
+ * have changed since the program was installed.
+ */
+ const struct user_regset_view *view = task_user_regset_view(current);
+ const struct user_regset *regset = &view->regsets[0];
+ size_t scratch_size = *available;
+ if (regset->core_note_type != NT_PRSTATUS) {
+ /* The architecture should override this method for speed. */
+ regset = find_prstatus(view);
+ if (!regset)
+ return NULL;
+ }
+ *available = regset->n * regset->size;
+ /* Make sure the scratch space isn't exceeded. */
+ if (*available > scratch_size)
+ *available = scratch_size;
+ if (regset->get(current, regset, 0, *available, scratch, NULL))
+ return NULL;
+ return scratch;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+ struct seccomp_filter *filter;
+ u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
+ size_t regs_size = sizeof(struct user_regs_struct);
+ int ret = -EACCES;
+
+ filter = current->seccomp.filter; /* uses task ref */
+ if (!filter)
+ goto out;
+
+ /* All filters in the list are required to share the same system call
+ * convention so only the first filter is ever checked.
+ */
+ if (seccomp_check_personality(filter))
+ goto out;
+
+ /* Grab the user_regs_struct. Normally, regs == &regs_tmp, but
+ * that is not mandatory. E.g., it may return a point to
+ * task_pt_regs(current). NULL checking is mandatory.

+ */
+ regs = seccomp_get_regs(regs_tmp, &regs_size);

+ if (!regs)
+ goto out;
+
+ /* Only allow a system call if it is allowed in all ancestors. */
+ ret = 0;
+ for ( ; filter != NULL; filter = filter->parent) {
+ /* Allowed if return value is the size of the data supplied. */
+ if (seccomp_run_filter(regs, regs_size, filter->insns) !=
+ regs_size)
+ ret = -EACCES;
+ }
+out:
+ return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+ struct seccomp_filter *filter = NULL;
+ /* Note, len is a short so overflow should be impossible. */
+ unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+ long ret = -EPERM;
+
+ /* Allocate a new seccomp_filter */
+ filter = seccomp_filter_alloc(fp_size);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+
+ /* Lock the process personality and calling convention. */
+#ifdef CONFIG_COMPAT
+ if (is_compat_task())
+ filter->flags.compat = 1;
+#endif
+ filter->personality = current->personality;
+
+ /* If a process lacks CAP_SYS_ADMIN in its namespace, force
+ * this process and all descendents to always run unprivileged.
+ * A privileged process will need to set this bit independently,
+ * if desired.
+ */
+ if (security_real_capable_noaudit(current, current_user_ns(),
+ CAP_SYS_ADMIN) != 0)
+ current->always_unprivileged = 1;
+
+ /* Copy the instructions from fprog. */
+ ret = -EFAULT;
+ if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ goto out;
+
+ /* Check the fprog */
+ ret = sk_chk_filter(filter->insns, filter->count);
+ if (ret)
+ goto out;
+
+ /* If there is an existing filter, make it the parent
+ * and reuse the existing task-based ref.
+ */
+ filter->parent = current->seccomp.filter;
+
+ /* Force all filters to use one system call convention. */
+ ret = -EINVAL;
+ if (filter->parent) {
+ if (filter->parent->flags.compat != filter->flags.compat)
+ goto out;
+ if (filter->parent->personality != filter->personality)
+ goto out;
+ }
+
+ /* Double claim the new filter so we can release it below simplifying
+ * the error paths earlier.
+ */
+ ret = 0;
+ get_seccomp_filter(filter);
+ current->seccomp.filter = filter;
+ /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
+ if (!current->seccomp.mode) {
+ current->seccomp.mode = 2;
+ set_thread_flag(TIF_SECCOMP);
+ }
+
+out:
+ put_seccomp_filter(filter); /* for get or task, on err */
+ return ret;
+}
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+ struct sock_fprog fprog;
+ long ret = -EINVAL;
+
+ ret = -EFAULT;
+ if (!user_filter)
+ goto out;
+
+ if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+ goto out;
+
+ ret = seccomp_attach_filter(&fprog);
+out:
+ return ret;
+}
+
+/* seccomp_struct_fork: manages inheritance on fork
+ * @child: forkee's seccomp_struct
+ * @parent: forker's seccomp_struct
+ * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
+ * and the set of filters is marked as 'enabled'.
+ */
+void seccomp_struct_fork(struct seccomp_struct *child,
+ const struct seccomp_struct *parent)
+{
+ if (!parent->mode)
+ return;
+ child->mode = parent->mode;
+ child->filter = get_seccomp_filter(parent->filter);
+}
+
+/* Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries. The signature almost matches the signature from
+ * net/core/filter.c with the hopes of sharing code in the future.
+ */
+static const void *load_pointer(const u8 *buf, size_t buflen,
+ int offset, size_t size,
+ void *unused)
+{
+ if (offset >= buflen)
+ goto fail;
+ if (offset < 0)
+ goto fail;
+ if (size > buflen - offset)
+ goto fail;
+ return buf + offset;
+fail:
+ return NULL;
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF (over user_regs_struct)
+ * @buf: buffer to execute the filter over
+ * @buflen: length of the buffer
+ * @fentry: filter to apply
+ *
+ * Decode and apply filter instructions to the buffer.
+ * Return length to keep, 0 for none. @buf is a regset we are
+ * filtering, @filter is the array of filter instructions.
+ * Because all jumps are guaranteed to be before last instruction,
+ * and last instruction guaranteed to be a RET, we dont need to check
+ * flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage of
+ * optimizations (like JIT).
+ *
+ * A successful filter must return the full length of the data. Anything less
+ * will currently result in a seccomp failure. In the future, it may be
+ * possible to use that for hard filtering registers on the fly so it is
+ * ideal for consumers to return 0 on intended failure.
+ */
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *fentry)
+{
+ const void *ptr;
+ u32 A = 0; /* Accumulator */
+ u32 X = 0; /* Index Register */
+ u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */
+ u32 tmp;
+ int k;
+
+ /*
+ * Process array of filter instructions.
+ */
+ for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define K (fentry->k)
+#else
+ const u32 K = fentry->k;
+#endif
+
+ switch (fentry->code) {
+ case BPF_S_ALU_ADD_X:
+ A += X;
+ continue;
+ case BPF_S_ALU_ADD_K:
+ A += K;
+ continue;
+ case BPF_S_ALU_SUB_X:
+ A -= X;
+ continue;
+ case BPF_S_ALU_SUB_K:
+ A -= K;
+ continue;
+ case BPF_S_ALU_MUL_X:
+ A *= X;
+ continue;
+ case BPF_S_ALU_MUL_K:
+ A *= K;
+ continue;
+ case BPF_S_ALU_DIV_X:
+ if (X == 0)
+ return 0;
+ A /= X;
+ continue;
+ case BPF_S_ALU_DIV_K:
+ A = reciprocal_divide(A, K);
+ continue;
+ case BPF_S_ALU_AND_X:
+ A &= X;
+ continue;
+ case BPF_S_ALU_AND_K:
+ A &= K;
+ continue;
+ case BPF_S_ALU_OR_X:
+ A |= X;
+ continue;
+ case BPF_S_ALU_OR_K:
+ A |= K;
+ continue;
+ case BPF_S_ALU_LSH_X:
+ A <<= X;
+ continue;
+ case BPF_S_ALU_LSH_K:
+ A <<= K;
+ continue;
+ case BPF_S_ALU_RSH_X:
+ A >>= X;
+ continue;
+ case BPF_S_ALU_RSH_K:
+ A >>= K;
+ continue;
+ case BPF_S_ALU_NEG:
+ A = -A;
+ continue;
+ case BPF_S_JMP_JA:
+ fentry += K;
+ continue;
+ case BPF_S_JMP_JGT_K:
+ fentry += (A > K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_K:
+ fentry += (A >= K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_K:
+ fentry += (A == K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_K:
+ fentry += (A & K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGT_X:
+ fentry += (A > X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_X:
+ fentry += (A >= X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_X:
+ fentry += (A == X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_X:
+ fentry += (A & X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_LD_W_ABS:
+ k = K;
+load_w:
+ ptr = load_pointer(buf, buflen, k, 4, &tmp);
+ if (ptr != NULL) {
+ /* Note, unlike on network data, values are not
+ * byte swapped.
+ */
+ A = *(const u32 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_H_ABS:
+ k = K;
+load_h:
+ ptr = load_pointer(buf, buflen, k, 2, &tmp);
+ if (ptr != NULL) {
+ A = *(const u16 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_B_ABS:
+ k = K;
+load_b:
+ ptr = load_pointer(buf, buflen, k, 1, &tmp);
+ if (ptr != NULL) {
+ A = *(const u8 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_W_LEN:
+ A = buflen;
+ continue;
+ case BPF_S_LDX_W_LEN:
+ X = buflen;
+ continue;
+ case BPF_S_LD_W_IND:
+ k = X + K;
+ goto load_w;
+ case BPF_S_LD_H_IND:
+ k = X + K;
+ goto load_h;
+ case BPF_S_LD_B_IND:
+ k = X + K;
+ goto load_b;
+ case BPF_S_LDX_B_MSH:
+ ptr = load_pointer(buf, buflen, K, 1, &tmp);
+ if (ptr != NULL) {
+ X = (*(u8 *)ptr & 0xf) << 2;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_IMM:
+ A = K;
+ continue;
+ case BPF_S_LDX_IMM:
+ X = K;
+ continue;
+ case BPF_S_LD_MEM:
+ A = mem[K];
+ continue;
+ case BPF_S_LDX_MEM:
+ X = mem[K];
+ continue;
+ case BPF_S_MISC_TAX:
+ X = A;
+ continue;
+ case BPF_S_MISC_TXA:
+ A = X;
+ continue;
+ case BPF_S_RET_K:
+ return K;
+ case BPF_S_RET_A:
+ return A;
+ case BPF_S_ST:
+ mem[K] = A;
+ continue;
+ case BPF_S_STX:
+ mem[K] = X;
+ continue;
+ case BPF_S_ANC_PROTOCOL:
+ case BPF_S_ANC_PKTTYPE:
+ case BPF_S_ANC_IFINDEX:
+ case BPF_S_ANC_MARK:
+ case BPF_S_ANC_QUEUE:
+ case BPF_S_ANC_HATYPE:
+ case BPF_S_ANC_RXHASH:
+ case BPF_S_ANC_CPU:
+ case BPF_S_ANC_NLATTR:
+ case BPF_S_ANC_NLATTR_NEST:
+ /* ignored */
+ continue;
+ default:
+ WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
+ fentry->code, fentry->jt,
+ fentry->jf, fentry->k);

+ return 0;
+ }
+ }
+
+ return 0;
+}

diff --git a/kernel/sys.c b/kernel/sys.c
index fbb6248..18b7397 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1782,6 +1782,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_SECCOMP:
error = prctl_set_seccomp(arg2);
break;
+ case PR_ATTACH_SECCOMP_FILTER:
+ error = prctl_attach_seccomp_filter((char __user *)
+ arg2);
+ break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..77b1106 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT

If you are unsure how to answer this question, answer N.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ select SECCOMP
+ depends on EXPERIMENTAL
+ help
+ This kernel feature expands CONFIG_SECCOMP to allow computing
+ in environments with reduced kernel access dictated by a system
+ call filter, expressed in BPF, installed by the application itself
+ through prctl(2).
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SECURITY
bool "Enable different security models"
depends on SYSFS
--
1.7.5.4

Will Drewry

unread,

Jan 12, 2012, 6:40:02 PM1/12/12

to

This patch is a placeholder until Andy's (lu...@mit.edu) patch arrives
implementing Linus's proposal for applying a "this is a process that has

*no* extra privileges at all, and can never get them".

It adds the "always_unprivileged" member to the task_struct and the
ability for a process to set it to 1 via prctl. Fixup is then done
alongside MNT_NOSUID in fs/exec.c and security/commoncap.c
(as Eric Paris suggested).

selinux/hooks.c have not been touched but need a similar one line
change. That said, this is just a placeholder and is not meant to
be authoritative: I look forward to Andy's patches!

Signed-off-by: Will Drewry <w...@chromium.org>
---

fs/exec.c | 3 ++-
include/linux/prctl.h | 6 ++++++
include/linux/sched.h | 1 +
kernel/sys.c | 4 +++-
security/commoncap.c | 3 ++-
5 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 3625464..ce0e477 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1281,7 +1281,8 @@ int prepare_binprm(struct linux_binprm *bprm)
bprm->cred->euid = current_euid();
bprm->cred->egid = current_egid();

- if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)) {
+ if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID) &&
+ !current->always_unprivileged) {
/* Set-uid? */
if (mode & S_ISUID) {
bprm->per_clear |= PER_CLEAR_ON_SETID;
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..d5d6ab6 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -102,4 +102,10 @@

#define PR_MCE_KILL_GET 34

+/*
+ * Set this to ensure that a process and any of its descendents may never
+ * escalate privileges (only reduce them).
+ */
+#define PR_SET_ALWAYS_UNPRIVILEGED 35
+
#endif /* _LINUX_PRCTL_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..2d6af15 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1402,6 +1402,7 @@ struct task_struct {
unsigned int sessionid;
#endif
seccomp_t seccomp;
+ int always_unprivileged;

/* Thread group tracking */
u32 parent_exec_id;
diff --git a/kernel/sys.c b/kernel/sys.c
index 481611f..fbb6248 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1776,7 +1776,6 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_ENDIAN:
error = SET_ENDIAN(me, arg2);
break;
-
case PR_GET_SECCOMP:
error = prctl_get_seccomp();
break;
@@ -1841,6 +1840,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
else
error = PR_MCE_KILL_DEFAULT;
break;
+ case PR_SET_ALWAYS_UNPRIVILEGED:

+ current->always_unprivileged = 1;

+ break;
default:
error = -EINVAL;
break;
diff --git a/security/commoncap.c b/security/commoncap.c
index ee4f848..ca94952 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -439,7 +439,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
if (!file_caps_enabled)
return 0;

- if (bprm->file->f_vfsmnt->mnt_flags & MNT_NOSUID)
+ if ((bprm->file->f_vfsmnt->mnt_flags & MNT_NOSUID) ||
+ current->always_unprivileged)
return 0;

dentry = dget(bprm->file->f_dentry);

Linus Torvalds

unread,

Jan 12, 2012, 6:50:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 3:38 PM, Will Drewry <w...@chromium.org> wrote:
> This patch is a placeholder until Andy's (lu...@mit.edu) patch arrives
> implementing Linus's proposal for applying a "this is a process that has
> *no* extra privileges at all, and can never get them".

I think we can simplify and improve the naming/logic by just saying
"can't change privileges".

I'd argue that that even includes "can't drop them", just to make it
really clear what the rules are.

So the usage model would be to first simply set the privileges to
whatever you want the sandbox to be, and then enter the restricted
mode.

Linus

Will Drewry

unread,

Jan 12, 2012, 7:00:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 6:51 PM, Randy Dunlap <rdu...@xenotime.net> wrote:

> On 01/12/2012 03:38 PM, Will Drewry wrote:
>> include/linux/prctl.h | 3 +
>> include/linux/seccomp.h | 68 +++++-
>> kernel/Makefile | 1 +
>> kernel/fork.c | 4 +
>> kernel/seccomp.c | 8 +
>> kernel/seccomp_filter.c | 620 +++++++++++++++++++++++++++++++++++++++++++++++
>> kernel/sys.c | 4 +
>> security/Kconfig | 12 +
>> 8 files changed, 717 insertions(+), 3 deletions(-)
>> create mode 100644 kernel/seccomp_filter.c
>>

> (in multiple places:) Kernel multi-line comment style is:
>
> /*
> * first line of text
> * more stuff
> */

Thanks! I'll roll through and clean them all up. My apologies!

>> +static inline void seccomp_struct_free_task(struct seccomp_struct *seccomp)
>> +{
>> + put_seccomp_filter(seccomp->filter);
>> + seccomp->filter = NULL;
>> +}
>> +
>> +#else /* CONFIG_SECCOMP_FILTER */
>> +
>> +#include <linux/errno.h>
>> +
>> +struct seccomp_filter { };
>> +/* Macros consume the unused dereference by the caller. */
>> +#define seccomp_struct_init_task(_seccomp) do { } while (0);
>> +#define seccomp_struct_fork(_tsk, _orig) do { } while (0);
>> +#define seccomp_struct_free_task(_seccomp) do { } while (0);
>> +
>> +static inline long prctl_attach_seccomp_filter(char __user *a2)
>> +{
>> + return -ENOSYS;
>> +}
>> +
>> +#endif /* CONFIG_SECCOMP_FILTER */
>> #endif /* _LINUX_SECCOMP_H */
>
>

>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 57d4b13..78719be 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
>> return;
>> } while (*++syscall);
>> break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + case 2:
>

> Can we get macros (defines) for these @modes instead of using
> inline constants?

Certainly!

Aside from being a year off, is there a current style? I just went
off of an existing file.

>
>> +/* seccomp_struct_fork: manages inheritance on fork
>

> /**
> * seccomp_struct_fork - manages inheritance on fork

Thanks - sorry!

>> + * @child: forkee's seccomp_struct
>> + * @parent: forker's seccomp_struct
>> + * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
>> + * and the set of filters is marked as 'enabled'.
>> + */
>> +void seccomp_struct_fork(struct seccomp_struct *child,
>> + const struct seccomp_struct *parent)
>> +{
>> + if (!parent->mode)
>> + return;
>> + child->mode = parent->mode;
>> + child->filter = get_seccomp_filter(parent->filter);
>> +}
>> +
>> +/* Returns a pointer to the BPF evaluator after checking the offset and size
>> + * boundaries. The signature almost matches the signature from
>> + * net/core/filter.c with the hopes of sharing code in the future.
>

> Use kernel multi-line comment style.

Of course.

>> + */
>> +static const void *load_pointer(const u8 *buf, size_t buflen,
>> + int offset, size_t size,
>> + void *unused)
>> +{
>> + if (offset >= buflen)
>> + goto fail;
>> + if (offset < 0)
>> + goto fail;
>> + if (size > buflen - offset)
>> + goto fail;
>> + return buf + offset;
>> +fail:
>> + return NULL;
>> +}
>> +
>

>> diff --git a/security/Kconfig b/security/Kconfig
>> index 51bd5a0..77b1106 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT
>>
>> If you are unsure how to answer this question, answer N.
>>
>> +config SECCOMP_FILTER
>> + bool "Enable seccomp-based system call filtering"
>> + select SECCOMP
>> + depends on EXPERIMENTAL
>> + help
>> + This kernel feature expands CONFIG_SECCOMP to allow computing
>> + in environments with reduced kernel access dictated by a system
>> + call filter, expressed in BPF, installed by the application itself
>> + through prctl(2).
>

> This help text is only useful to someone who already knows what it does/means
> IMO.

I'll attempt to clean that up so it makes actual sense to those without context!

>> +
>> + See Documentation/prctl/seccomp_filter.txt for more detail.
>

> Yes, I'll look at that..

Awesome - thanks!

>> +
>> config SECURITY
>> bool "Enable different security models"
>> depends on SYSFS
>
>
> --

> ~Randy
> *** Remember to use Documentation/SubmitChecklist when testing your code ***

Randy Dunlap

unread,

Jan 12, 2012, 7:00:02 PM1/12/12

to

On 01/12/2012 03:38 PM, Will Drewry wrote:

(in multiple places:) Kernel multi-line comment style is:

/*
* first line of text
* more stuff

*/

> +static inline void seccomp_struct_free_task(struct seccomp_struct *seccomp)
> +{
> + put_seccomp_filter(seccomp->filter);
> + seccomp->filter = NULL;
> +}
> +
> +#else /* CONFIG_SECCOMP_FILTER */
> +
> +#include <linux/errno.h>
> +
> +struct seccomp_filter { };
> +/* Macros consume the unused dereference by the caller. */
> +#define seccomp_struct_init_task(_seccomp) do { } while (0);
> +#define seccomp_struct_fork(_tsk, _orig) do { } while (0);
> +#define seccomp_struct_free_task(_seccomp) do { } while (0);
> +
> +static inline long prctl_attach_seccomp_filter(char __user *a2)
> +{
> + return -ENOSYS;
> +}
> +
> +#endif /* CONFIG_SECCOMP_FILTER */
> #endif /* _LINUX_SECCOMP_H */

> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 57d4b13..78719be 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
> return;
> } while (*++syscall);
> break;
> +#ifdef CONFIG_SECCOMP_FILTER
> + case 2:

Can we get macros (defines) for these @modes instead of using
inline constants?

...

> +/* seccomp_struct_fork: manages inheritance on fork

/**
* seccomp_struct_fork - manages inheritance on fork

> + * @child: forkee's seccomp_struct
> + * @parent: forker's seccomp_struct
> + * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
> + * and the set of filters is marked as 'enabled'.
> + */
> +void seccomp_struct_fork(struct seccomp_struct *child,
> + const struct seccomp_struct *parent)
> +{
> + if (!parent->mode)
> + return;
> + child->mode = parent->mode;
> + child->filter = get_seccomp_filter(parent->filter);
> +}
> +
> +/* Returns a pointer to the BPF evaluator after checking the offset and size
> + * boundaries. The signature almost matches the signature from
> + * net/core/filter.c with the hopes of sharing code in the future.

Use kernel multi-line comment style.

> + */
> +static const void *load_pointer(const u8 *buf, size_t buflen,
> + int offset, size_t size,
> + void *unused)
> +{
> + if (offset >= buflen)
> + goto fail;
> + if (offset < 0)
> + goto fail;
> + if (size > buflen - offset)
> + goto fail;
> + return buf + offset;
> +fail:
> + return NULL;
> +}
> +

> diff --git a/security/Kconfig b/security/Kconfig
> index 51bd5a0..77b1106 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT
>
> If you are unsure how to answer this question, answer N.
>
> +config SECCOMP_FILTER
> + bool "Enable seccomp-based system call filtering"
> + select SECCOMP
> + depends on EXPERIMENTAL
> + help
> + This kernel feature expands CONFIG_SECCOMP to allow computing
> + in environments with reduced kernel access dictated by a system
> + call filter, expressed in BPF, installed by the application itself
> + through prctl(2).

This help text is only useful to someone who already knows what it does/means
IMO.

> +
> + See Documentation/prctl/seccomp_filter.txt for more detail.

Yes, I'll look at that..

> +
> config SECURITY
> bool "Enable different security models"
> depends on SYSFS

--

~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

Will Drewry

unread,

Jan 12, 2012, 7:10:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 5:47 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Thu, Jan 12, 2012 at 3:38 PM, Will Drewry <w...@chromium.org> wrote:
>> This patch is a placeholder until Andy's (lu...@mit.edu) patch arrives
>> implementing Linus's proposal for applying a "this is a process that has
>> *no* extra privileges at all, and can never get them".
>
> I think we can simplify and improve the naming/logic by just saying
> "can't change privileges".
>
> I'd argue that that even includes "can't drop them", just to make it
> really clear what the rules are.
>
> So the usage model would be to first simply set the privileges to
> whatever you want the sandbox to be, and then enter the restricted
> mode.

That sounds ideal to me. This placeholder is certainly insufficient
then. I'll keep tweaking this patch then until its successor emerges.

Thanks!
will

Randy Dunlap

unread,

Jan 12, 2012, 7:40:02 PM1/12/12

to

Sorry, I wasn't making a comment about that header. Guess I should
have deleted it.

thanks,

Andrew Lutomirski

unread,

Jan 12, 2012, 7:50:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 3:47 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Thu, Jan 12, 2012 at 3:38 PM, Will Drewry <w...@chromium.org> wrote:
>> This patch is a placeholder until Andy's (lu...@mit.edu) patch arrives
>> implementing Linus's proposal for applying a "this is a process that has
>> *no* extra privileges at all, and can never get them".
>
> I think we can simplify and improve the naming/logic by just saying
> "can't change privileges".
>
> I'd argue that that even includes "can't drop them", just to make it
> really clear what the rules are.

That may prevent another use: set this new flag, chroot, drop
privileges, accept network connections. (The idea being that chroot
might work unprivileged if this flag is set.)

--Andy

Linus Torvalds

unread,

Jan 12, 2012, 8:00:01 PM1/12/12

to

On Thu, Jan 12, 2012 at 4:42 PM, Andrew Lutomirski <lu...@mit.edu> wrote:
>
> That may prevent another use: set this new flag, chroot, drop
> privileges, accept network connections. (The idea being that chroot
> might work unprivileged if this flag is set.)

Well, if you have privileges, then just do

chroot();
drop privileges

and if you depend on the new flag, then you do

drop privileges
set new flag
chroot

and if you want to work either way then you just do

error = chroot
drop privileges
set new flag
if error
chroot

which does the right thing regardless of whether you had privileges
and/or a new kernel or not.

In any of the three cases I don't see why you'd ever want to drop
privileges *after* setting the new flag.

Linus

Linus Torvalds

unread,

Jan 12, 2012, 8:20:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 5:11 PM, Andrew Lutomirski <lu...@mit.edu> wrote:
>
> What if you're a daemon that needs something like CAP_NET_BIND but
> also wants to be able to run other helpers without CAP_NET_BIND?
>
> (Also, preventing dropping of privileges will probably make a patch
> more complicted -- I'll have to find and update all the places that
> allow dropping privileges.)

Hey, if it actually makes it more complicated to say "don't change
privileges", then I guess my argument that it should be simpler is
wrong.

That said, the thing you bring up is *not* the actual use-case for the
suggestion. The use-case is a "run untrusted code". So the use-case
would be to set the flag after you've dropped CAP_NET_BIND, and
*before* you actually run the other helpers. You clearly must have a
fork() or something like that there, since you want to keep the
NET_BIND in the original daemon.

So I don't t think your example is actually the expected situation.
The obvious approach for your example is
- run deamon with CAP_NET_BIND
- per connection, fork, drop privileges in child, then set
"restricted" flag, and run the untrusted code.

(where the restricted mode setting may well obviously also do other
things - like limit allowed system calls etc)

Andrew Lutomirski

unread,

Jan 12, 2012, 8:20:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 4:57 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Thu, Jan 12, 2012 at 4:42 PM, Andrew Lutomirski <lu...@mit.edu> wrote:
>>
>> That may prevent another use: set this new flag, chroot, drop
>> privileges, accept network connections. (The idea being that chroot
>> might work unprivileged if this flag is set.)
>
> Well, if you have privileges, then just do
>
> chroot();
> drop privileges
>
> and if you depend on the new flag, then you do
>
> drop privileges
> set new flag
> chroot
>
> and if you want to work either way then you just do
>
> error = chroot
> drop privileges
> set new flag
> if error
> chroot
>
> which does the right thing regardless of whether you had privileges
> and/or a new kernel or not.
>
> In any of the three cases I don't see why you'd ever want to drop
> privileges *after* setting the new flag.

Hmm...

What if you're a daemon that needs something like CAP_NET_BIND but
also wants to be able to run other helpers without CAP_NET_BIND?

(Also, preventing dropping of privileges will probably make a patch
more complicted -- I'll have to find and update all the places that
allow dropping privileges.)

--Andy

Will Drewry

unread,

Jan 12, 2012, 8:40:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 7:11 PM, Andrew Lutomirski <lu...@mit.edu> wrote:
> (Also, preventing dropping of privileges will probably make a patch
> more complicted -- I'll have to find and update all the places that
> allow dropping privileges.)

An alternative approach might be that the restricted bit drops all
privileges that allows privilege changes in either direction. E.g.,
- set restricted bit
-- adds a check anywhere MNT_NOSUID is
-- sets securebit to SECURE_NOROOT|..LOCKED
-- drops CAP_SETUID, CAP_DAC_OVERRIDE, ...
-- set the caps bounding set to the minimum the restricted bit allows

That may deviate from the intent (by re-using caps), but it could keep some
of the privilege transition checking code the same.

Just a thought,
will

Jamie Lokier

unread,

Jan 12, 2012, 8:40:02 PM1/12/12

to

Will Drewry wrote:
> >> > There's already a security model around who can use ptrace(); speeding
> >> > it up needn't break that.
> >> >
> >> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
> >> > needed as userspace could have done it, with exactly the restrictions
> >> > it wants. Google's NaCl comes to mind as a potential user.
> >>
> >> That's not entirely true. ptrace supervisors are subject to races and
> >> always fail open. This makes them effective but not as robust as a
> >> seccomp solution can provide.
> >
> > What races do you know about?
>
> I'm pretty sure that if you have two "isolated" processes, they could
> cause irregular behavior using signals.

Do you have an example? I'm not aware of one and I've been studying
ptrace quite a bit lately. If there's a race (other than temporary
kernel bugs with all the ptrace patching lately ;-), I would like to
know and maybe patch it.

The only signal confusion when ptracing syscalls I'm aware of is with
SIGTRAP, and that was fixed in 2.5.46, long, long ago (PTRACE_SETOPTIONS).

> > I'm not aware of any ptrace races if it's used properly. I'm also not
> > sure what you mean by fail open/close here, unless you mean the target
> > process gets to carry on if the tracing process dies.
>
> Exactly. Security systems that, on failure, allow the action to
> proceed can't be relied on.

That's fair enough. There are numerous occasions when ptracer death
should kill the tracee anyway regardless of security. E.g. "strace
command..." and strace dies, you'd normally want the command to
be killed as well. So that could be worth a ptrace option anyway.

Thanks,
-- Jamie

James Morris

unread,

Jan 12, 2012, 8:40:02 PM1/12/12

to

On Thu, 12 Jan 2012, Alan Cox wrote:

> In general I like this approach. It's simple, it's compact and it offers
> interesting possibilities for solving some interesting problem spaces,
> without the full weight of SELinux, SMACK etc which are still needed for
> heavyweight security.

Yes, I can see potential to vastly simplify MAC policy in some cases.

- James
--
James Morris
<jmo...@namei.org>

Andrew Lutomirski

unread,

Jan 12, 2012, 8:50:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 5:37 PM, Will Drewry <w...@chromium.org> wrote:
> On Thu, Jan 12, 2012 at 7:11 PM, Andrew Lutomirski <lu...@mit.edu> wrote:
>> (Also, preventing dropping of privileges will probably make a patch
>> more complicted -- I'll have to find and update all the places that
>> allow dropping privileges.)
>
> An alternative approach might be that the restricted bit drops all
> privileges that allows privilege changes in either direction. E.g.,
> - set restricted bit
> -- adds a check anywhere MNT_NOSUID is
> -- sets securebit to SECURE_NOROOT|..LOCKED
> -- drops CAP_SETUID, CAP_DAC_OVERRIDE, ...
> -- set the caps bounding set to the minimum the restricted bit allows
>
> That may deviate from the intent (by re-using caps), but it could keep some
> of the privilege transition checking code the same.

I'm not sure it'll be much of a simplification. The entire patch is
45 lines right now :) I'll test it and send it out.

FWIW, though, it breaks apparmor (intentionally). Can any of you
either explain what *should* happen or (better) volunteer to fix it?
It should be about three lines of code for someone who understands
what's going on. I don't have an apparmor system, so I can't really
test it.

--Andy

Kees Cook

unread,

Jan 12, 2012, 9:20:02 PM1/12/12

to

On Thu, Jan 12, 2012 at 5:41 PM, Andrew Lutomirski <lu...@mit.edu> wrote:
> On Thu, Jan 12, 2012 at 5:37 PM, Will Drewry <w...@chromium.org> wrote:
>> On Thu, Jan 12, 2012 at 7:11 PM, Andrew Lutomirski <lu...@mit.edu> wrote:
>>> (Also, preventing dropping of privileges will probably make a patch
>>> more complicted -- I'll have to find and update all the places that
>>> allow dropping privileges.)
>>
>> An alternative approach might be that the restricted bit drops all
>> privileges that allows privilege changes in either direction. E.g.,
>> - set restricted bit
>> -- adds a check anywhere MNT_NOSUID is
>> -- sets securebit to SECURE_NOROOT|..LOCKED
>> -- drops CAP_SETUID, CAP_DAC_OVERRIDE, ...
>> -- set the caps bounding set to the minimum the restricted bit allows
>>
>> That may deviate from the intent (by re-using caps), but it could keep some
>> of the privilege transition checking code the same.
>
> I'm not sure it'll be much of a simplification. The entire patch is
> 45 lines right now :) I'll test it and send it out.
>
> FWIW, though, it breaks apparmor (intentionally). Can any of you
> either explain what *should* happen or (better) volunteer to fix it?
> It should be about three lines of code for someone who understands
> what's going on. I don't have an apparmor system, so I can't really
> test it.

I'm happy to take a look at the AppArmor breakage. What's happening
with it at the moment?

--
Kees Cook
ChromeOS Security

Indan Zupancic

unread,

Jan 12, 2012, 9:50:02 PM1/12/12

to

Hello,

I think execve should be allowed and follow the same rules as execve under ptrace.

On Thu, January 12, 2012 18:57, Jamie Lokier wrote:
> Will Drewry wrote:
>> On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <ja...@shareable.org> wrote:
>> > Will Drewry wrote:
>> >> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <ros...@goodmis.org> wrote:
>> >> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>> >> >
>> >> >> Filter programs may _only_ cross the execve(2) barrier if last filter
>> >> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> >> >> user namespace. �Once a task-local filter program is attached from a
>> >> >> process without privileges, execve will fail. �This ensures that only
>> >> >> privileged parent task can affect its privileged children (e.g., setuid
>> >> >> binary).
>> >> >
>> >> > This means that a non privileged user can not run another program with
>> >> > limited features? How would a process exec another program and filter
>> >> > it? I would assume that the filter would need to be attached first and
>> >> > then the execv() would be performed. But after the filter is attached,
>> >> > the execv is prevented?
>> >>
>> >> Yeah - it means tasks can filter themselves, but not each other.
>> >> However, you can inject a filter for any dynamically linked executable
>> >> using LD_PRELOAD.
>> >>
>> >> > Maybe I don't understand this correctly.
>> >>
>> >> You're right on. �This was to ensure that one process didn't cause
>> >> crazy behavior in another. I think Alan has a better proposal than
>> >> mine below. �(Goes back to catching up.)
>> >
>> > You can already use ptrace() to cause crazy behaviour in another
>> > process, including modifying registers arbitrarily at syscall entry
>> > and exit, aborting and emulating syscalls.
>> >
>> > ptrace() is quite slow and it would be really nice to speed it up,
>> > especially for trapping a small subset of syscalls, or limiting some
>> > kinds of access to some file descriptors, while everything else runs
>> > at normal speed.
>> >
>> > Speeding up ptrace() with BPF filters would be a really nice. �Not
>> > that I like ptrace(), but sometimes it's the only thing you can rely on.
>> >
>> > LD_PRELOAD and code running in the target process address space can't
>> > always be trusted in some contexts (e.g. the target process may modify
>> > the tracing code or its data); whereas ptrace() is pretty complete and
>> > reliable, if ugly.

>> >
>> > There's already a security model around who can use ptrace(); speeding
>> > it up needn't break that.
>> >
>> > If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
>> > needed as userspace could have done it, with exactly the restrictions
>> > it wants. �Google's NaCl comes to mind as a potential user.
>>
>> That's not entirely true. ptrace supervisors are subject to races and
>> always fail open. This makes them effective but not as robust as a
>> seccomp solution can provide.
>
> What races do you know about?
>

> I'm not aware of any ptrace races if it's used properly. I'm also not
> sure what you mean by fail open/close here, unless you mean the target
> process gets to carry on if the tracing process dies.

That one could be easily fixed with a new ptrace option.

The tracer can kill all traced tasks before it dies except when it exits
with a SIGKILL. In that case another observer task could kill all the
traced tasks, but that is just moving the problem around.

> Having said that, I can think of one race, but I think your BPF scheme
> has the same one: After checking the syscall's string arguments and
> other pointed to data, another thread can change those arguments
> before the real syscall uses them.

I have implemented a ptrace based jailer which avoids these kinds of
races by copying such strings to read-only memory before the system call
is allowed to proceed. Only races that can't be closed with ptrace are
symlink races, and then only with an attacker outside the jail.

And the architectural differences in registers are easily abstracted away
when you're only interested in system call arguments and the instruction
pointer. The system call table information is more annoying, but unavoidable.

Our jailer is around 5k lines of code and supports checking file paths, PIDs,
FDs, SYSV IPC and has limited networking support (no incoming peer address
filtering), all race free. The idea is transparent jailing of complex tasks
with minimal configuration (everything is contained within the jail, access
to anything else needs explicit permission). It's more or less finished for
a few years now, but everyone is busy with other things and no one got around
releasing the code. :-/

It would be nice to avoid the ptrace overhead for system calls that are always
allowed or always denied, so I hope this BPF filtering can be made to work in
conjunction with ptrace so that the tracer only has to handle system calls not
handled by the BPF filter. One way to achieve that is to have a way for the BPF
filter to let a system call generate ptrace system call events or not, with a
new ptrace option PTRACE_UNHANDLED_SYSCALL or something like that to ask for
the unhandled system calls events.

Greetings,

Indan

Will Drewry

unread,

Jan 12, 2012, 10:40:01 PM1/12/12

to

[This patch depends on lu...@mit.edu's no_new_privs patch:
https://lkml.org/lkml/2012/1/12/446
]

This patch adds support for seccomp mode 2. This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified

by a userland task. The policy is expressed in terms of a Berkeley
Packet Filter program, as is used for userland-exposed socket filtering.

Instead of network data, the BPF program is evaluated over struct
user_regs_struct at the time of the system call (as retrieved using
regviews).

A filter program may be installed by a userland task by calling
prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.

If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached. All attached programs
must be evaluated before a system call will be allowed to proceed.

To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() and current->personality, it
is not allowed to make system calls or attach additional filters which
use a different combination of is_compat_task() and
current->personality.

Filter programs will be inherited across fork/clone and execve.
However, if the task attaching the filter is unprivileged
(!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
ensures that unprivileged tasks cannot attach filters that affect
privileged tasks (e.g., setuid binary).

There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: expected register layout and system
call numbers.
- Full register information is provided which may be relevant for
certain syscalls (fork, rt_sigreturn) or for other userland
filtering tactics (checking the PC).
- No time-of-check-time-of-use vulnerable data accesses are possible.

This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code. It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)

v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
- now uses current->no_new_privs
(lu...@mit.edu,torv...@linux-foundation.com)
- assign names to seccomp modes (rdu...@xenotime.net)
- fix style issues (rdu...@xenotime.net)
- reworded Kconfig entry (rdu...@xenotime.net)

v3: - macros to inline (ol...@redhat.com)
- init_task behavior fixed (ol...@redhat.com)
- drop creator entry and extra NULL check (ol...@redhat.com)
- alloc returns -EINVAL on bad sizing (serge....@canonical.com)

- adds tentative use of "always_unprivileged" as per
torv...@linux-foundation.org and lu...@mit.edu
v2: - (patch 2 only)

Signed-off-by: Will Drewry <w...@chromium.org>
---

include/linux/prctl.h | 3 +
include/linux/seccomp.h | 70 +++++-

kernel/Makefile | 1 +
kernel/fork.c | 4 +

kernel/seccomp.c | 10 +-
kernel/seccomp_filter.c | 642 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 4 +
security/Kconfig | 16 ++
8 files changed, 746 insertions(+), 4 deletions(-)
create mode 100644 kernel/seccomp_filter.c

diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..81c6081 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h

@@ -64,6 +64,9 @@
#define PR_GET_SECCOMP 21
#define PR_SET_SECCOMP 22

+/* Set process seccomp filters */

+#define PR_ATTACH_SECCOMP_FILTER 37

+
/* Get/set the capability bounding set (as per security/commoncap.c) */
#define PR_CAPBSET_READ 23
#define PR_CAPBSET_DROP 24

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..d73fc35 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,9 +5,30 @@

#ifdef CONFIG_SECCOMP

#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>

-typedef struct { int mode; } seccomp_t;

+/* Valid values of seccomp_struct.mode */
+#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */
+#define SECCOMP_MODE_STRICT 1 /* uses hard-coded seccomp.c rules. */
+#define SECCOMP_MODE_FILTER 2 /* system call access determined by filter. */
+

+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *

+ * @mode: indicates one of the valid values above for controlled
+ * system calls available to a process.

+ * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
+ * @filter must only be accessed from the context of current as there
+ * is no guard.
+ */
+typedef struct seccomp_struct {
+ int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+ struct seccomp_filter *filter;
+#endif
+} seccomp_t;

extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)

@@ -28,8 +49,7 @@ static inline int seccomp_mode(seccomp_t *s)

#include <linux/errno.h>

-typedef struct { } seccomp_t;
-
+typedef struct seccomp_struct { } seccomp_t;
#define secure_computing(x) do { } while (0)

static inline long prctl_get_seccomp(void)

@@ -49,4 +69,48 @@ static inline int seccomp_mode(seccomp_t *s)

#endif /* CONFIG_SECCOMP */

+#ifdef CONFIG_SECCOMP_FILTER
+
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_struct_fork(struct seccomp_struct *child,
+ const struct seccomp_struct *parent);
+
+static inline void seccomp_struct_init_task(struct seccomp_struct *seccomp)
+{

+ seccomp->mode = SECCOMP_MODE_DISABLED;

+ seccomp->filter = NULL;
+}
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */

+static inline void seccomp_struct_free_task(struct seccomp_struct *seccomp)
+{
+ put_seccomp_filter(seccomp->filter);
+ seccomp->filter = NULL;
+}
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+/* Macros consume the unused dereference by the caller. */
+#define seccomp_struct_init_task(_seccomp) do { } while (0);
+#define seccomp_struct_fork(_tsk, _orig) do { } while (0);
+#define seccomp_struct_free_task(_seccomp) do { } while (0);
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+ return -ENOSYS;
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
#endif /* _LINUX_SECCOMP_H */

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..b87c1bc 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -36,7 +36,7 @@ void __secure_computing(int this_syscall)
int * syscall;

switch (mode) {
- case 1:
+ case SECCOMP_MODE_STRICT:
syscall = mode1_syscalls;
#ifdef CONFIG_COMPAT
if (is_compat_task())

@@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER

+ case SECCOMP_MODE_FILTER:

+ if (seccomp_test_filters(this_syscall) == 0)
+ return;
+
+ seccomp_filter_log_failure(this_syscall);
+ break;
+#endif
default:
BUG();
}

diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644

index 0000000..b1c3f60
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,642 @@
+/*
+ * linux/kernel/seccomp_filter.c

+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *

+ * Copyright (C) 2012 The Chromium OS Authors <chromiu...@chromium.org>
+ *
+ * Extends linux/kernel/seccomp.c to allow tasks to install system call
+ * filters using a Berkeley Packet Filter program which is executed over
+ * struct user_regs_struct.

+ return NULL;
+}
+

+/**
+ * seccomp_get_regs - returns a pointer to struct user_regs_struct
+ * @scratch: preallocated storage of size @available
+ * @available: pointer to the size of scratch.
+ *
+ * Returns NULL if the registers cannot be acquired or copied.
+ * Returns a populated pointer to @scratch by default.
+ * Otherwise, returns a pointer to a a u8 array containing the struct
+ * user_regs_struct appropriate for the task personality. The pointer
+ * may be to the beginning of @scratch or to an externally managed data
+ * structure. On success, @available should be updated with the
+ * valid region size of the returned pointer.
+ *
+ * If the architecture overrides the linkage, then the pointer may pointer to
+ * another location.
+ */
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+ /*

+ *regset is usually returned based on task personality, not current

+ * system call convention. This behavior makes it unsafe to execute
+ * BPF programs over regviews if is_compat_task or the personality
+ * have changed since the program was installed.
+ */
+ const struct user_regset_view *view = task_user_regset_view(current);
+ const struct user_regset *regset = &view->regsets[0];
+ size_t scratch_size = *available;
+ if (regset->core_note_type != NT_PRSTATUS) {
+ /* The architecture should override this method for speed. */
+ regset = find_prstatus(view);
+ if (!regset)

+ return NULL;
+ }

+ * All filters in the list are required to share the same system call

+ * convention so only the first filter is ever checked.
+ */
+ if (seccomp_check_personality(filter))
+ goto out;
+
+ /*

+ * Grab the user_regs_struct. Normally, regs == &regs_tmp, but

+ *

+ * If a process lacks CAP_SYS_ADMIN in its namespace, force
+ * this process and all descendents to run with no_new_privs.

+ * A privileged process will need to set this bit independently,
+ * if desired.
+ */
+ if (security_real_capable_noaudit(current, current_user_ns(),
+ CAP_SYS_ADMIN) != 0)

+ current->no_new_privs = 1;

+
+ /* Copy the instructions from fprog. */
+ ret = -EFAULT;
+ if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ goto out;
+
+ /* Check the fprog */
+ ret = sk_chk_filter(filter->insns, filter->count);
+ if (ret)
+ goto out;
+
+ /*

+ * If there is an existing filter, make it the parent

+ * and reuse the existing task-based ref.
+ */
+ filter->parent = current->seccomp.filter;
+
+ /* Force all filters to use one system call convention. */
+ ret = -EINVAL;
+ if (filter->parent) {
+ if (filter->parent->flags.compat != filter->flags.compat)
+ goto out;
+ if (filter->parent->personality != filter->personality)
+ goto out;
+ }
+
+ /*

+ * Double claim the new filter so we can release it below simplifying

+ * the error paths earlier.
+ */
+ ret = 0;
+ get_seccomp_filter(filter);
+ current->seccomp.filter = filter;
+ /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */

+ if (current->seccomp.mode == SECCOMP_MODE_DISABLED) {
+ current->seccomp.mode = SECCOMP_MODE_FILTER;

+/**
+ * seccomp_struct_fork: manages inheritance on fork

+ * @child: forkee's seccomp_struct
+ * @parent: forker's seccomp_struct
+ *

+ * Ensures that @child inherits seccomp mode and state iff
+ * seccomp filtering is in use.

+ */
+void seccomp_struct_fork(struct seccomp_struct *child,
+ const struct seccomp_struct *parent)
+{

+ child->mode = parent->mode;

+ if (parent->mode != SECCOMP_MODE_FILTER)
+ return;

+ child->filter = get_seccomp_filter(parent->filter);
+}
+

+/**
+ * load_pointer: checks and returns a pointer to the requested offset
+ * @buf: u8 array to index into
+ * @buflen: length of the @buf array
+ * @offset: offset to return data from
+ * @size: size of the data to retrieve at offset
+ * @unused: placeholder which net/core/filter.c uses for for temporary
+ * storage. Ideally, the two code paths can be merged.
+ *
+ * Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries.

+ */
+static const void *load_pointer(const u8 *buf, size_t buflen,
+ int offset, size_t size,
+ void *unused)
+{
+ if (offset >= buflen)
+ goto fail;
+ if (offset < 0)
+ goto fail;
+ if (size > buflen - offset)
+ goto fail;
+ return buf + offset;
+fail:
+ return NULL;
+}
+

+ * Note, unlike on network data, values are not

diff --git a/kernel/sys.c b/kernel/sys.c
index 481611f..77f2eda 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,

case PR_SET_SECCOMP:
error = prctl_set_seccomp(arg2);
break;
+ case PR_ATTACH_SECCOMP_FILTER:
+ error = prctl_attach_seccomp_filter((char __user *)
+ arg2);
+ break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
break;

diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..6491bb1 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,22 @@ config SECURITY_DMESG_RESTRICT

If you are unsure how to answer this question, answer N.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ select SECCOMP

+ help
+ This option provide support for limiting the accessibility of
+ systems calls at a task-level using a dynamically defined policy.
+
+ System call filtering policy is expressed by the user using
+ a Berkeley Packet Filter program. The program is attached using
+ prctl(2). For every system call the task makes, its register
+ state will be evaluated by the attached filter program. The
+ result determines if the system call may proceed or if the
+ task should be terminated.

+
+ See Documentation/prctl/seccomp_filter.txt for more detail.

+
config SECURITY
bool "Enable different security models"
depends on SYSFS
--

1.7.5.4

Chris Evans

unread,

Jan 13, 2012, 1:40:02 AM1/13/12

to

Yeah, that's one and it's a pretty awful one when you can consider
that the untrusted tracee can play games such as trying to get the
kernel to fire OOM SIGKILLs.

My memory is hazy but the last time I looked at this in detail there
were other racy areas:

- Bad problems if the tracee takes a SIGTSTP or (real) SIGCONT.
- Difficulty in stopping the syscall from executing once it has
started, especially if the tracer dies.

Cheers
Chris

>
> Having said that, I can think of one race, but I think your BPF scheme
> has the same one: After checking the syscall's string arguments and
> other pointed to data, another thread can change those arguments
> before the real syscall uses them.
>

>> With seccomp, it fails close. What I think would make sense would be
>> to add a user-controllable failure mode with seccomp bpf that calls
>> tracehook_ptrace_syscall_entry(regs). I've prototyped this and it
>> works quite well, but I didn't want to conflate the discussions.
>
> It think it's a nice idea. While you're at it could you fix all the
> architectures to actually use tracehooks for syscall tracing ;-)
>
> (I think it's ok to call the tracehook function on all archs though.)
>
>> Using ptrace() would also mean that all consumers of this interface
>> would need a supervisor, but with seccomp, the filters are installed
>> and require no supervisors to stick around for when failure occurs.
>>
>> Does that make sense?
>
> It does, I agree that ptrace() is quite cumbersome and you don't
> always want a separate tracing process, especially if "failure" means
> to die or get an error.
>
> On the other hand, sometimes when a failure occurs, having another
> process decide what to do, or log the event, is exactly what you want.
>
> For my nefarious purposes I'm really just looking for a faster way to
> reliably trace some activities of individual processes, in particular
> tracking which files they access. I'd rather not interfere with
> debuggers, so I'd really like your ability to stack multiple filters
> to work with separate-process tracing as well. And I'd happily use a
> filter rule which can dump some information over a pipe, without
> waiting for the tracer to respond in most cases.
>
> -- Jamie

Oleg Nesterov

unread,

Jan 13, 2012, 12:40:01 PM1/13/12

to

On 01/12, Will Drewry wrote:
>
> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <ol...@redhat.com> wrote:
> > On 01/12, Will Drewry wrote:

> >>
> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <ol...@redhat.com> wrote:
> >> >> + */
> >> >> + regs = seccomp_get_regs(regs_tmp, &regs_size);
> >> >

> >> > Stupid question. I am sure you know what are you doing ;) and I know
> >> > nothing about !x86 arches.
> >> >
> >> > But could you explain why it is designed to use user_regs_struct ?
> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
> >>
> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
> >> (iirc), but on x86-64
> >> and others, that's not true.
> >
> > Yes sure, I meant that userpace should use pt_regs too.
> >
> >> If it would be appropriate to expose pt_regs to userspace, then I'd
> >> happily do so :)
> >
> > Ah, so that was the reason. But it is already exported? At least I see
> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
> >
> > Once again, I am not arguing, just trying to understand. And I do not
> > know if this definition is part of abi.
>
> I don't either :/ My original idea was to operate on task_pt_regs(current),
> but I noticed that PTRACE_GETREGS/SETREGS only uses the
> user_regs_struct. So I went that route.

Well, I don't know where user_regs_struct come from initially. But
probably it is needed to allow to access the "artificial" things like
fs_base. Or perhaps this struct mimics the layout in the coredump.

> I'd love for pt_regs to be fair game to cut down on the copying!

Me too. I see no point in using user_regs_struct.

Oleg.

Eric Paris

unread,

Jan 13, 2012, 12:50:02 PM1/13/12

to

On Thu, 2012-01-12 at 17:38 -0600, Will Drewry wrote:

> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index cc7a4e9..0296871 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h

> -typedef struct { int mode; } seccomp_t;
> +struct seccomp_filter;
> +/**
> + * struct seccomp_struct - the state of a seccomp'ed process
> + *
> + * @mode:
> + * if this is 0, seccomp is not in use.
> + * is 1, the process is under standard seccomp rules.
> + * is 2, the process is only allowed to make system calls where
> + * associated filters evaluate successfully.
> + * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
> + * @filter must only be accessed from the context of current as there
> + * is no guard.
> + */
> +typedef struct seccomp_struct {
> + int mode;
> +#ifdef CONFIG_SECCOMP_FILTER
> + struct seccomp_filter *filter;
> +#endif
> +} seccomp_t;
>
> extern void __secure_computing(int);
> static inline void secure_computing(int this_syscall)

Can we get rid of all of the typedef stuff? I know you didn't add it
but now seems like a good time to follow typical kernel semantics if you
have to re-rev for some other reason.

-Eric

Will Drewry

unread,

Jan 13, 2012, 2:00:01 PM1/13/12

to

Yup - I was hoping to do that separately since it touches extra files.
I'll make a prereq patch that enacts the change (so it can be picked
up even if the seccomp-bpf is less successful).

cheers!
will

Will Drewry

unread,

Jan 13, 2012, 2:10:02 PM1/13/12

to

Not sure - added Roland whose name was on many of the files :)

I just noticed that ptrace ABI allows pt_regs access using the register
macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).

But I think the latter is guaranteed to have a certain layout while the macros
for PEEKUSR can do post-processing fixup. (Which could be done in the
bpf evaluator load_pointer() helper if needed.)

>> I'd love for pt_regs to be fair game to cut down on the copying!
>
> Me too. I see no point in using user_regs_struct.

I'll rev the change to use pt_regs and drop all the helper code. If
no one says otherwise, that certainly seems ideal from a performance
perspective, and I see pt_regs exported to userland along with ptrace
abi register offset macros.

Thanks!
will

Will Drewry

unread,

Jan 13, 2012, 6:20:01 PM1/13/12

to

On second thought, pt_regs is scary :)

From looking at
http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
and ia32syscall enty code, it appears that for x86, at least, the
pt_regs for compat processes will be 8 bytes wide per register on the
stack. This means if a self-filtering 32-bit program runs on a 64-bit host in
IA32_EMU, its filters will always index into pt_regs incorrectly.

I'm not 100% that I am reading the code right, but it means that I can either
keep using user_regs_struct or fork the code behavior based on compat. That
would need to be arch dependent then which is pretty rough.

Any thoughts?

I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
about the pt_regs
change yet. If the performance boost is worth the effort of having a
per-arch fixup,
I can go that route. Otherwise, I could look at some alternate approach for a
faster-than-regview payload.

Thanks!

Will Drewry

unread,

Jan 13, 2012, 6:20:01 PM1/13/12

to

Ugh. Sorry about the formatting. (The other option is to disallow compat ;).

cheers!
will

Eric Paris

unread,

Jan 13, 2012, 6:40:02 PM1/13/12

to

For anyone who is interested I hacked up a program to turn what I think
is a readable seccomp syntax into BPF rules. It should make it easier
to prototype this new thing. The translator needs a LOT of love to be
worth much, but for now it can handle a couple of things and can build a
set of rules!

The rules are of the form:
label object:
value label

So using Will's BPF example code in my syntax looks like:

start syscall:
rt_sigreturn success
sigreturn success
exit_group success
exit success
read read
write write
read arg0:
0 success
write arg0:
1 success
2 success

So this says the first label is "start" and it is going to deal with the
syscall number. The first value is 'rt_sigreturn' and if syscall ==
rt_sigreturn will cause you to jump to 'success' (success and fail are
implied labels). If the syscall is 'write' we will jump to 'write.'
The write rules look at arg0. If arg0 == "1" we jump to "success". If
you run that syntax through my translator you should get Will's BPF
rules!

You'll quickly notice that the translator only understands "syscall" and
"arg0" and only x86_32, but it should be easy to add more, support the
right registers on different arches, etc, etc. If others think they
might want to hack on the translator I put it at:

http://git.infradead.org/users/eparis/bpf-translate.git

-Eric

translate.py

Jamie Lokier

unread,

Jan 14, 2012, 8:40:02 AM1/14/12

to

Linus Torvalds wrote:
> On Thu, Jan 12, 2012 at 5:11 PM, Andrew Lutomirski <lu...@mit.edu> wrote:
> >
> > What if you're a daemon that needs something like CAP_NET_BIND but
> > also wants to be able to run other helpers without CAP_NET_BIND?
> >
> > (Also, preventing dropping of privileges will probably make a patch
> > more complicted -- I'll have to find and update all the places that
> > allow dropping privileges.)
>
> Hey, if it actually makes it more complicated to say "don't change
> privileges", then I guess my argument that it should be simpler is
> wrong.
>
> That said, the thing you bring up is *not* the actual use-case for the
> suggestion. The use-case is a "run untrusted code". So the use-case
> would be to set the flag after you've dropped CAP_NET_BIND, and
> *before* you actually run the other helpers. You clearly must have a
> fork() or something like that there, since you want to keep the
> NET_BIND in the original daemon.

Well suppose you don't trust the daemon either. It might be running
in a network namespace where it's safe for untrusted code to bind to
low ports.

Or maybe you just need to let it bind willy-nilly among a restricted
subset of low ports - which of course you would like to restrict with
the seccomp filter.

(This can't happen right now because the filter can only look at
arguments, not memory pointed to - so it can't look at the port
number. Can it even see when sys_bind is called on archs like x86
that use sys_socketcall?!)

Anyway the principle is there - CAP_NET_BIND doesn't necessarily mean
the daemon code is trusted.

-- Jamie

Will Drewry

unread,

Jan 14, 2012, 2:30:02 PM1/14/12

to

On Sat, Jan 14, 2012 at 7:30 AM, Jamie Lokier <ja...@shareable.org> wrote:
> Linus Torvalds wrote:
>> On Thu, Jan 12, 2012 at 5:11 PM, Andrew Lutomirski <lu...@mit.edu> wrote:
>> >
>> > What if you're a daemon that needs something like CAP_NET_BIND but
>> > also wants to be able to run other helpers without CAP_NET_BIND?
>> >
>> > (Also, preventing dropping of privileges will probably make a patch
>> > more complicted -- I'll have to find and update all the places that
>> > allow dropping privileges.)
>>
>> Hey, if it actually makes it more complicated to say "don't change
>> privileges", then I guess my argument that it should be simpler is
>> wrong.
>>
>> That said, the thing you bring up is *not* the actual use-case for the
>> suggestion. The use-case is a "run untrusted code". So the use-case
>> would be to set the flag after you've dropped CAP_NET_BIND, and
>> *before* you actually run the other helpers. You clearly must have a
>> fork() or something like that there, since you want to keep the
>> NET_BIND in the original daemon.
>
> Well suppose you don't trust the daemon either. It might be running
> in a network namespace where it's safe for untrusted code to bind to
> low ports.
>
> Or maybe you just need to let it bind willy-nilly among a restricted
> subset of low ports - which of course you would like to restrict with
> the seccomp filter.

Unless the port values are the register arguments, seccomp filter
won't help. It can be used to incrementally drop available system
calls (like socketcall(SYS_LISTEN) or whatever).

> (This can't happen right now because the filter can only look at
> arguments, not memory pointed to - so it can't look at the port
> number. Can it even see when sys_bind is called on archs like x86
> that use sys_socketcall?!)

Yeah - multiplexed system calls like ipc and socketcall can be filtered
based on the argument value in the register. (socketcall's first argument is
"call".)

> Anyway the principle is there - CAP_NET_BIND doesn't necessarily mean
> the daemon code is trusted.

I think we're comparing apples to oranges. I believe the current proposal is a
bit that says "hey! I'm sandboxed!". Defensive programming that is often
achieved through continued reduction of capabilities is important, but
orthogonal. In that model, only once the last vestige of "privilege" is dropped
would the process set the no_new_privs bit. Until then, you rely on the
other access contol pieces you've put in place: namespacing, etc.

While I am a fan of capabilities systems, it would be very cool to have a
bottom floor, privilege-freezer which could help against some classes of
sandbox escapes.

cheers!
will

Linus Torvalds

unread,

Jan 14, 2012, 3:30:01 PM1/14/12

to

On Sat, Jan 14, 2012 at 5:30 AM, Jamie Lokier <ja...@shareable.org> wrote:
>
> Anyway the principle is there - CAP_NET_BIND doesn't necessarily mean
> the daemon code is trusted.

I really think all these arguments are *COMPLETELY* missing the point.

You don't have to use the new flag if you don't want to. Just let it go.

The point of the flag is to not allow security changes. It's that
easy. If you want something else, don't use it.

Because even "dropping capabilities" is quite often the wrong thing to
do. To one person it's "dropping capabilities", to another it is "no
longer running with the capabilities I need".

We've had security bugs that were *due* to dropped capabilities -
people dropped one capability but not another, and fooled code into
doing things they weren't expecting it to do. So quite frankly, I
believe that from a security standpoint it's a hell of a lot safer to
just keep the rules really simple.

Think of the "restrict" bit as "my universe will now run with this
*known* set of permissions".

And if you don't like it, don't use it. It really is that simple. But
don't make the model more complicated. Don't confuse it with "but but
capabilites" crap. The point of the patch is that it makes all of that
go away. There are no capabilities. There only is what you can do.

And yes, I really seriously do believe that is both safer and simpler
than some model that says "you can drop stuff", and then you have to
start making up rules for what "dropping" means.

Does "dropping" mean allowing setuid(geteuid()) for example? That *is*
dropping the uid in a _POSIX_SAVED_IDS environment. And I'm saying
that no, we should not even allow that. It's simply all too "subtle".

(I don't think Andrew's patch actually touched any of those paths, but
I didn't check)

Linus

Andrew Lutomirski

unread,

Jan 14, 2012, 4:10:02 PM1/14/12

to

On Sat, Jan 14, 2012 at 12:22 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> We've had security bugs that were *due* to dropped capabilities -
> people dropped one capability but not another, and fooled code into
> doing things they weren't expecting it to do. So quite frankly, I
> believe that from a security standpoint it's a hell of a lot safer to
> just keep the rules really simple.
>

Which is the point of this patch. If suid/sgid/file caps/LSM
transitions are disabled, you can't exploit thinks like sendmail
because they don't have any special privileges to exploit.

>
> Does "dropping" mean allowing setuid(geteuid()) for example? That *is*
> dropping the uid in a _POSIX_SAVED_IDS environment. And I'm saying
> that no, we should not even allow that. It's simply all too "subtle".
>
> (I don't think Andrew's patch actually touched any of those paths, but
> I didn't check)

Correct. I did not touch any of these paths. I think that's the
right choice for at least three reasons:

1. If there's an exploit that does PR_SET_NO_NEW_CAPS, then
setuid/capset/whatever, then runs something and exploits it, it is
likely to work just as will if setuid/capset is called before
PR_SET_NO_NEW_CAPS.

2. Even if I made that change, it wouldn't help. You can always
simulate dropping capabilities with LD_PRELOAD or something like
seccomp mode 2. The secure exec stuff that happens when you execve
something with the setuid bit set won't prevent it because the setuid
bit is *disabled* in this mode. (Again, there is no exploit here --
there are no new privileges to steal.)

3. Simplicity. It would be easy to miss something subtle. There's
the dumpable bit and however it interacts with ptrace (especially on
Ubuntu, which seems to have a patched kernel with different rules),
there's capset, there's file capabilities (which can, I think, drop
capabilities as well as adding them), there's cap_bset, securebits,
etc. Leaving them all alone means I don't need to worry about bitrot
or about missing one.

I'm tempted to submit a followup patch to allow unprivileged users to
use some of the sys_unshare modes if no_new_privs is set so we can
have an example use case. CLONE_NEWIPC, CLONE_SYSVSEM, and
CLONE_NEWUTS would be a decent start, I think, and the patch would be
trivial.

--Andy

Randy Dunlap

unread,

Jan 14, 2012, 8:00:02 PM1/14/12

to

On 01/12/2012 03:38 PM, Will Drewry wrote:
> Documents how system call filtering using Berkeley Packet
> Filter programs works and how it may be used.
> Includes an example for x86 (32-bit).
>
> v3: - call out BPF <-> Berkeley Packet Filter (rdu...@xenotime.net)
> - document use of tentative always-unprivileged
> - guard sample compilation for i386 and x86_64
> v2: - move code to samples (cor...@lwn.net)

>
> Signed-off-by: Will Drewry <w...@chromium.org>
> ---

> Documentation/prctl/seccomp_filter.txt | 94 ++++++++++++++++++++++++++++++++
> samples/Makefile | 2 +-
> samples/seccomp/Makefile | 18 ++++++
> samples/seccomp/bpf-example.c | 74 +++++++++++++++++++++++++
> 4 files changed, 187 insertions(+), 1 deletions(-)
> create mode 100644 Documentation/prctl/seccomp_filter.txt
> create mode 100644 samples/seccomp/Makefile
> create mode 100644 samples/seccomp/bpf-example.c
>
> diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
> new file mode 100644
> index 0000000..2db8b89
> --- /dev/null
> +++ b/Documentation/prctl/seccomp_filter.txt
> @@ -0,0 +1,94 @@
> + Seccomp filtering
> + =================
> +
> +Introduction
> +------------
> +
> +A large number of system calls are exposed to every userland process
> +with many of them going unused for the entire lifetime of the process.
> +As system calls change and mature, bugs are found and eradicated. A
> +certain subset of userland applications benefit by having a reduced set
> +of available system calls. The resulting set reduces the total kernel
> +surface exposed to the application. System call filtering is meant for
> +use with those applications.
> +
> +Seccomp filtering provides a means for a process to specify a filter
> +for incoming system calls. The filter is expressed as a Berkeley Packet
> +Filter (BPF) program, as with socket filters, except that the data
> +operated on is the current user_regs_struct. This allows for expressive
> +filtering of system calls using the pre-existing system call ABI and
> +using a filter program language with a long history of being exposed to
> +userland. Additionally, BPF makes it impossible for users of seccomp to
> +fall prey to time-of-check-time-of-use (TOCTOU) attacks that are common
> +in system call interposition frameworks because the evaluated data is
> +solely register state just after system call entry.
> +
> +What it isn't
> +-------------
> +
> +System call filtering isn't a sandbox. It provides a clearly defined
> +mechanism for minimizing the exposed kernel surface. Beyond that,
> +policy for logical behavior and information flow should be managed with
> +a combinations of other system hardening techniques and, potentially, a

combination an

> +LSM of your choosing. Expressive, dynamic filters provide further options down
> +this path (avoiding pathological sizes or selecting which of the multiplexed
> +system calls in socketcall() is allowed, for instance) which could be
> +construed, incorrectly, as a more complete sandboxing solution.
> +
> +Usage
> +-----
> +
> +An additional seccomp mode is added, but they are not directly set by the
> +consuming process. The new mode, '2', is only available if
> +CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
> +PR_ATTACH_SECCOMP_FILTER argument.
> +
> +Interacting with seccomp filters is done using one prctl(2) call.
> +
> +PR_ATTACH_SECCOMP_FILTER:
> + Allows the specification of a new filter using a BPF program.
> + The BPF program will be executed over a user_regs_struct data
> + reflecting system call time except with the system call number
> + resident in orig_[register]. To allow a system call, the size
> + of the data must be returned. At present, all other return values
> + result in the system call being blocked, but it is recommended to
> + return 0 in those cases. This will allow for future custom return
> + values to be introduced, if ever desired.
> +
> + Usage:
> + prctl(PR_ATTACH_SECCOMP_FILTER, prog);
> +
> + The 'prog' argument is a pointer to a struct sock_fprog which will
> + contain the filter program. If the program is invalid, the call
> + will return -1 and set errno to -EINVAL.

EINVAL.
(I think)

> +
> + The struct user_regs_struct the @prog will see is based on the
> + personality of the task at the time of this prctl call. Additionally,
> + is_compat_task is also tracked for the @prog. This means that once set
> + the calling task will have all of its system calls blocked if it
> + switches its system call ABI (via personality or other means).
> +
> + If fork/clone and execve are allowed by @prog, any child processes will
> + be constrained to the same filters and syscal call ABI as the parent.

syscall

> +
> + When called from an unprivileged process (lacking CAP_SYS_ADMIN), the
> + "always_unprivileged" bit is enabled for the process.
> +
> + Additionally, if prctl(2) is allowed by the attached filter,
> + additional filters may be layered on which will increase evaluation
> + time, but allow for further decreasing the attack surface during
> + execution of a process.
> +
> +The above call returns 0 on success and non-zero on error.
> +
> +Example
> +-------
> +
> +samples/seccomp-bpf-example.c shows an example process that allows read from stdin,

samples/seccomp/bpf-example.c

> +write to stdout/err, exit and signal returns for 32-bit x86.

/stderr,

> +
> +Adding architecture support
> +-----------------------
> +
> +Any platform with seccomp support will support seccomp filters
> +as long as CONFIG_SECCOMP_FILTER is enabled.

--
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

Indan Zupancic

unread,

Jan 14, 2012, 10:50:01 PM1/14/12

to

Hello,

(I hoped everyone is CC'ed, emails forwarded via lkml truncate the CC list
when it's too long.)

I recommend not giving access to the registers at all, but to instead provide
a fixed cross-platform ABI (a 32 bit version and one 64 bit version).

As everyone dealing with system call is mostly interested in the same things:
The syscall number and the arguments. You can add some other potentially useful
info like instruction pointer as well, but keep it limited to cross-platform
things with a clear meaning that make sense for system call filtering.

So I propose an interface like the following instead of a register interface:

/* Currently 6, but to be future proof, make it 8 */
#define MAX_SC_ARGS 8

struct syscall_bpf_data {
unsigned long syscall_nr;
unsigned long flags;
unsigned long instruction_pointer;
unsigned long arg[MAX_SC_ARGS];
unsigned long _reserved[5];
};

The flag argument can be used to e.g. tell if it is a compat 32 program
running on a 64 bit system.

This way the registers have to be interpreted only once by the kernel and all
filtering programs don't have to do that mapping themselves. It also avoids
doing unnecessary work fiddling/translating registers like the ptrace ABI does.

I missed if the original version was allowed to change the registers or not,
if it is then perhaps the BPF program should set a specific flag after changing
anything, to make it more explicit.

Greetings,

Indan

Casey Schaufler

unread,

Jan 15, 2012, 3:20:01 PM1/15/12

to

I am casting my two cents worth behind Linus. Dropping
privilege can be every bit as dangerous as granting privilege
in the real world of atrocious user land code. Especially in
the case of security policy enforcing user land code.

This even more important in environments that support fine
granularity of privilege, including capabilities and SELinux.
Under SELinux a domain transition can increase, decrease or
completely change a process' access rights and there is really
no way for the kernel to tell which it is because that's all
encoded in the arbitrary SELinux policy. Smack does not try
to maintain a notion of hierarchy of privilege, so the notion
of any change being equivalent to any other is in line with
the Smack philosophy.

Andrew Lutomirski

unread,

Jan 15, 2012, 4:10:02 PM1/15/12

to

On Sun, Jan 15, 2012 at 12:16 PM, Casey Schaufler
<ca...@schaufler-ca.com> wrote:
> On 1/14/2012 12:22 PM, Linus Torvalds wrote:
>>
>> And yes, I really seriously do believe that is both safer and simpler
>> than some model that says "you can drop stuff", and then you have to
>> start making up rules for what "dropping" means.
>>
>> Does "dropping" mean allowing setuid(geteuid()) for example? That *is*
>> dropping the uid in a _POSIX_SAVED_IDS environment. And I'm saying
>> that no, we should not even allow that. It's simply all too "subtle".
>
>
> I am casting my two cents worth behind Linus. Dropping
> privilege can be every bit as dangerous as granting privilege
> in the real world of atrocious user land code. Especially in
> the case of security policy enforcing user land code.

Can you think of *any* plausible attack that is possible with my patch
(i.e. no_new_privs allows setuid, setresuid, and capset) that would be
prevented or even mitigated if I blocked those syscalls? I can't.
(The sendmail-style attack is impossible with no_new_privs.)

Also, how would you even block setuid(2) in a non-confusing manner?
The semantics and error returns are already such a disaster than it's
barely worth it for anything to check the return value.

>
> This even more important in environments that support fine
> granularity of privilege, including capabilities and SELinux.
> Under SELinux a domain transition can increase, decrease or
> completely change a process' access rights and there is really
> no way for the kernel to tell which it is because that's all
> encoded in the arbitrary SELinux policy. Smack does not try
> to maintain a notion of hierarchy of privilege, so the notion
> of any change being equivalent to any other is in line with
> the Smack philosophy.
>

My patch does not (barring bugs) allow selinux domain transitions. I
certainly think that all security transitions that vary across
distributions should be blocked by no_new_privs.

--Andy

Casey Schaufler

unread,

Jan 15, 2012, 4:40:03 PM1/15/12

to

On 1/15/2012 12:59 PM, Andrew Lutomirski wrote:
> On Sun, Jan 15, 2012 at 12:16 PM, Casey Schaufler
> <ca...@schaufler-ca.com> wrote:
>> On 1/14/2012 12:22 PM, Linus Torvalds wrote:
>>> And yes, I really seriously do believe that is both safer and simpler
>>> than some model that says "you can drop stuff", and then you have to
>>> start making up rules for what "dropping" means.
>>>
>>> Does "dropping" mean allowing setuid(geteuid()) for example? That *is*
>>> dropping the uid in a _POSIX_SAVED_IDS environment. And I'm saying
>>> that no, we should not even allow that. It's simply all too "subtle".
>>
>> I am casting my two cents worth behind Linus. Dropping
>> privilege can be every bit as dangerous as granting privilege
>> in the real world of atrocious user land code. Especially in
>> the case of security policy enforcing user land code.
> Can you think of *any* plausible attack that is possible with my patch
> (i.e. no_new_privs allows setuid, setresuid, and capset) that would be
> prevented or even mitigated if I blocked those syscalls? I can't.
> (The sendmail-style attack is impossible with no_new_privs.)

I am notoriously bad at coming up with this sort of example.
I will try, I may not hit the mark, but it should be close.

The application is running with saved uid != euid when
no-new-privs is set. It execs a new binary, which keeps
the saved and effective uids. The program calls setreuid,
which succeeds. It opens the saved userid's files.

Some people reading this will say this is expected and
appropriate behavior, and others will say it's an exploit
under sandbox conditions. Those in the latter group can
point out that blocking setreuid() would prevent what
they consider an exploit. Those in the former group will
say blocking the setreuid would introduce a denial of
service that could lead to an exploit if the program is
poorly written.

Because the correctness of the behavior is ambiguous
(to me at least) when no_new_privs is set I argue that
we shouldn't have it. If we are going to have it my
feeling is that it should be as draconian as possible.
Because you can't always tell if state A is "more"
privileged than state B, any change between A and B has
to be treated as an escalation.

Andrew Lutomirski

unread,

Jan 15, 2012, 5:10:02 PM1/15/12

to

If you don't trust that binary, then why are you execing it with saved
uid != euid in the first place? If you are setting no_new_privs, then
you are new code and should have at least some basic awareness of the
semantics. The exact same "exploit" is possible if you have
CAP_DAC_OVERRIDE with either no_new_privs semantics -- if you have a
privilege and you run untrusted code, then you had better remove that
privilege somehow for the untrusted code.

IOW, *drop privileges if you are a sandbox*. Otherwise you're screwed
with or without no_new_privs.

Another way of saying this is: no_new_privs is not a sandbox. It's
just a way to make it safe for sandboxes and other such weird things
processes can do to themselves safe across execve. If you want a
sandbox, use seccomp mode 2, which will require you to set

Will Drewry

unread,

Jan 15, 2012, 8:50:01 PM1/15/12

to

I don't believe it is possible to create a fixed cross-platform ABI
that will be compat and future safe. user_regs_struct (or even
pt_regs) should always match whatever each arch is doing -- even weird
personality-based changes. If we do a new ABI, not only does that
have to be exported to userland, but it means we're still copying and
munging the data around (which was why I was trying to see if pt_regs
was a easier way to get a speed boost).

If there's consensus, I'll change it (and use syscall_get_arguments),
but I don't believe it makes sense. (more below)

> As everyone dealing with system call is mostly interested in the same things:
> The syscall number and the arguments. You can add some other potentially useful
> info like instruction pointer as well, but keep it limited to cross-platform
> things with a clear meaning that make sense for system call filtering.

Well there are also clone/fork, sig_rt_return, etc related registers
too. I like not making the decision for each syscall filtering
consumer. We have an ABI so it seems like I'd be making work for the
kernel to manage yet another one for system call calling conventions.

> So I propose an interface like the following instead of a register interface:
>
> /* Currently 6, but to be future proof, make it 8 */
> #define MAX_SC_ARGS 8
>
> struct syscall_bpf_data {
> unsigned long syscall_nr;
> unsigned long flags;
> unsigned long instruction_pointer;
> unsigned long arg[MAX_SC_ARGS];
> unsigned long _reserved[5];
> };
>
> The flag argument can be used to e.g. tell if it is a compat 32 program
> running on a 64 bit system.

I certainly considered this, but I don't think this is a practical
idea. Firstly, CONFIG_COMPAT is meant to be compatibility mode. We
can't assume a program knows about it. Second, if we assume any new
program will be "smart" and check @flags, then the first few
instruction of _every_ (32-bit) seccomp filter program will be
checking compat mode - a serious waste :( I'm also not sure if
is_compat_task actually covers all random personality-based changes --
just 32-bit v 64-bit.

I _really_ wanted to make compat a flag and push that logic out of the
kernel, but I don't think it makes sense to burden all ABI consumers
with a "just in case" compat flag check. Also, what happens if a new,
weird architecture comes along where that flag doesn't make the same
sense? We can fix all the internal kernel stuff, but we'd end up with
an ABI change to boot :/ Using regviews, we stay consistent
regardless of whatever the new craziness is. I just wish there was a
way to make it speedier.

> This way the registers have to be interpreted only once by the kernel and all
> filtering programs don't have to do that mapping themselves. It also avoids
> doing unnecessary work fiddling/translating registers like the ptrace ABI does.

The kernel does only interpret them once (after entry to
__secure_computing). It gets the regview and has it populate a
user_regs_struct. All the register info is per-arch and matches
PTRACE_GETREGS, but not PTRACE_PEEKUSR. All the weird stuff is in
PEEKUSR to deal with the fact that compat pt_regs members are not
actually the same width as userspace would expect. If we populated an
ABI as you've proposed, we'd at least need to build that data set and
give it syscall_get_arguments() output.

I was hoping I could just hand over pt_regs and avoid any processing,
but it doesn't look promising. In theory, all the same bit-twiddling
compat_ptrace does could be done during load_pointer in the patch
series, but it seems wrong to go that route.

> I missed if the original version was allowed to change the registers or not,
> if it is then perhaps the BPF program should set a specific flag after changing
> anything, to make it more explicit.

Registers are const from the BPF perspective (just like with socket
filters). Adding support for tracehook interception later could
allow for supervisor guided register mutation.

Thanks!
will

Will Drewry

unread,

Jan 15, 2012, 8:50:01 PM1/15/12

to

Thanks for the close reading! I've got another patchset mostly rolled
and I'll pull these in too.
will

Will Drewry

unread,

Jan 15, 2012, 9:10:02 PM1/15/12

to

One consideration could be to add do_exit()s at known DAC transitions
(set*id, fcaps). I don't know if that'd be wise, but it would remove
some described ambiguity. The same could be done with exec when the
(e)uid/gid/fcaps change. However, none of that helps with the opaque
LSM data, so that'd have to be left up to the LSMs and the LSM_* flag
you've added.

This does seem subject to bitrot though, and your current approach
certainly suits the needs of the general
don't-do-something-stupid-on-exec that seccomp filter, unprivileged
chroot, unshare, etc all need (which is really nice).

> Another way of saying this is: no_new_privs is not a sandbox. It's
> just a way to make it safe for sandboxes and other such weird things
> processes can do to themselves safe across execve. If you want a
> sandbox, use seccomp mode 2, which will require you to set
> no_new_privs.

And even with system call filtering, you'll need to have appropriately
set up accessible files, file descriptors, memory maps, etc if you
want a proper sandbox. There's no single sandbox-equivalent kernel
feature. no_new_privs would be an excellent addition to the sandbox
toolkit. (I've tried to synthesize it in userspace with an empty
capability bounding set, SECURE_NOROOT, and MNT_NOSUID. Having this
in a future-proof task_struct bit form would be amazing.)

cheers!
will

Casey Schaufler

unread,

Jan 15, 2012, 9:50:02 PM1/15/12

to

If I could trust the binary I wouldn't need your no_new_privs
semantics in the first place. Do you have any idea how big the
chrome browser binary is? You can't link it on a 32bit machine
it uses so much address space. On top of that, most modern
applications are compositions of scripts and interpreters built
on top of multiple layers of middleware. Of course I don't trust
the binary!

> If you are setting no_new_privs, then
> you are new code and should have at least some basic awareness of the
> semantics.

It's not the program setting no_new_privs that I'm worried about.
It's the nth descendant of that program and its ancestors that are
going to do screwy things.

> The exact same "exploit" is possible if you have
> CAP_DAC_OVERRIDE with either no_new_privs semantics -- if you have a
> privilege and you run untrusted code, then you had better remove that
> privilege somehow for the untrusted code.

Yes, and my very own name is engraved on the security wall of shame
for the classic sendmail capabilities exploit. Don't think for a
minute that something won't get done just because its obviously
inappropriate.