Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[RFC,PATCH 0/2] dynamic seccomp policies (using BPF filters)

476 views
Skip to first unread message

Will Drewry

unread,
Jan 11, 2012, 12:30:03 PM1/11/12
to
The goal of the patchset is straightforward:

To provide a means of reducing the kernel attack surface.

In practice, this is done at the primary kernel ABI: system calls.
Achieving this goal will address the needs expressed by many systems
projects:
qemu/kvm, openssh, vsftpd, lxc, and chromium and chromium os (me).

While system call filtering has been attempted many times, I hope that
this approach shows more promise. It works as described below and in
the patch series.

A userland task may call prctl(PR_ATTACH_SECCOMP_FILTER) to attach a
BPF program to itself. Once attached, all system calls made by the
task will be evaluated by the BPF program prior to being accepted.
Evaluation is done by executing the BPF program over the struct
user_regs_state for the process.

!! If you don't care about background or reasoning, stop reading !!

Past attempts have used:
- bitmap of system call numbers evaluated by seccomp (or tracehooks)
- standalone data structures and extra entry hooks
(cgroups syscall, systrace)
- a collection of ftrace filter strings evaluated by seccomp
- perf_event hackery to allow process termination when an event matches
(or doesn't)

In addition to the publicly posted approaches, I've personally attempted
continued deeper integration with ftrace along a number of different
lines (lead up to that can be found here[1]). What inspired the current
patch series was a number of realizations:
1. Userland knows its ABI - that's how it made the system calls in the
first place.
2. We already exposed a filtering system to userland processes in the
form of BPF and there is continued focus on optimizing evaluation
even after so many years.
3. System call filtering policies should not expose
time-of-check-time-of-use (TOCTOU) vulnerable interfaces but should
expose all the information that may be relevant to a syscall policy
decision.

The prior seccomp-ftrace implementations struggled with very
fixable challenges in ftrace: incomplete syscall coverage,
mismatched syscall names versus unistd, incomplete arch coverage,
etc. These challenges may all be fixed with some time and effort, and
potentially, even closer integration. I explored a number of
alternative approaches from making system call tracepoints per-thread
and "active" to adding a new less-perf-oriented system call.

In the process of experimentation, a number of things became clear:
- perf/ftrace system-wide analysis goals don't align with lightweight
per-thread analysis.
- ftrace/perf ABI doesn't mix well with security policy enforcement,
reduced attack surface environments, or keeping users from specifing
vulnerable filtering policies.
- other than system calls, tracepoints aren't considered ABI-stable.

The core focus of ftrace and perf is to support system-wide
performance and debugging tracing. Despite its amazing flexibility,
there are tradeoffs that are made to provide efficient system-wide
behavior that are less efficient at a per-thread level. For instance,
system call tracepoints are global. It is possible to make them
per-thread (since they use a TIF anyway). However, doing so would mean
that a system-wide system call analysis would require one trace event
per thread rather than one total. It's possible to alleviate that pain,
but that in turn requires more bookkeeping (global versus local
tracepoint registrations mapping to the thread info flag).

Another example is the ftrace ABI. Both the debugfs entry point with
unstable event ids and the perf-oriented perf_event_open(2) are not
suitable to providing a subsystem which is meant to reduce the attack
surface -- much less avoid maintainer flame wars :) The third aspect of
its ABI was also concerning and hints at yet-another-potential struggle.
The ftrace filter language happily accepts globbing and string matching.
This is excellent for tracing, but horrible for system call
interposition. If, despite warning, a user decides that blocking a
system call based on a string is what they want, they can do it. The
result is that their policy may be bypassed due to a time of check, time
of use race. While addressable, it would mean that the filtering engine
would need to allow operation filtering or offer a "secure" subset.

A side challenge that emerged from the desire to enable tracing to act
as a security policy mechanism was the ability to enact policy over more
than just the system calls. While this would be doable if all
tracepoints became active, there is a fundamental problem in that very
little, if any, tracepoints aside from system calls can be considered
stable. If a subset were to emerge as stable, there is still the
challenge of enacting security policy in parallel with tracing policy.
In an example patch where security policy logic was added to
perf_event_open(2), the basics of the system worked, but enforcement of
the security policy was simplistic and intertwined with a large number
of event attributes that were meaningless or altered the behavior.

At every turn, it appears that the tracing infrastructure was unsuited
for being used for attack surface reduction or as a larger security
subsystem on its own. It is well suited for feeding a policy
enforcement mechanism (like seccomp), but not for letting the logic
co-exist. It doesn't mean that it has security problems, just that
there will be a continued struggle between having a really good perf
system and and really good kernel attack surface reduction system if
they were merged. While there may be some distant vision where the
apparent struggle does not exist, I don't see how it would be reached.
Of course, anything is possible with unlimited time. :)

That said, much of that discussion is history and to fill in some of the
gaps since I posted the last ftrace-based patches. This patch series
should stand on its own as both straightforward and effective. In my
opinion, this is the direction I should have taken before I sent my
first patch.

I am looking forward to any and all feedback - thanks!
will


[1] http://search.gmane.org/?query=seccomp+wad%40chromium.org&group=gmane.linux.kernel


Will Drewry (3):
seccomp_filters: dynamic system call filtering using BPF programs
Documentation: prctl/seccomp_filter

Documentation/prctl/seccomp_filter.txt | 179 ++++++++
fs/exec.c | 5 +
include/linux/prctl.h | 3 +
include/linux/seccomp.h | 70 +++++-
kernel/Makefile | 1 +
kernel/fork.c | 4 +
kernel/seccomp.c | 8 +
kernel/seccomp_filter.c | 639 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 4 +
security/Kconfig | 12 +
9 files changed, 743 insertions(+), 3 deletions(-)
create mode 100644 kernel/seccomp_filter.c
create mode 100644 Documentation/prctl/seccomp_filter.txt
--
1.7.5.4
























--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Will Drewry

unread,
Jan 11, 2012, 12:30:03 PM1/11/12
to
This patch adds support for seccomp mode 2. This mode enables dynamic
enforcement of system call filtering policy in the kernel as specified
by a userland task. The policy is expressed in terms of a BPF program,
as is used for userland-exposed socket filtering. Instead of network
data, the BPF program is evaluated over struct user_regs_struct at the
time of the system call (as retrieved using regviews).

A filter program may be installed by a userland task by calling
prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.

If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached. All attached programs
must be evaluated before a system call will be allowed to proceed.

To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() and current->personality, it
is not allowed to make system calls or attach additional filters which
use a different combination of is_compat_task() and
current->personality.

Filter programs may _only_ cross the execve(2) barrier if last filter
program was attached by a task with CAP_SYS_ADMIN capabilities in its
user namespace. Once a task-local filter program is attached from a
process without privileges, execve will fail. This ensures that only
privileged parent task can affect its privileged children (e.g., setuid
binary).

There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: expected register layout and system
call numbers.
- Full register information is provided which may be relevant for
certain syscalls (fork, rt_sigreturn) or for other userland
filtering tactics (checking the PC).
- No time-of-check-time-of-use vulnerable data accesses are possible.

This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code. It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)

Signed-off-by: Will Drewry <w...@chromium.org>
---
fs/exec.c | 5 +
include/linux/prctl.h | 3 +
include/linux/seccomp.h | 70 +++++-
kernel/Makefile | 1 +
kernel/fork.c | 4 +
kernel/seccomp.c | 8 +
kernel/seccomp_filter.c | 639 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 4 +
security/Kconfig | 12 +
9 files changed, 743 insertions(+), 3 deletions(-)
create mode 100644 kernel/seccomp_filter.c

diff --git a/fs/exec.c b/fs/exec.c
index 3625464..e9cc89c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -44,6 +44,7 @@
#include <linux/namei.h>
#include <linux/mount.h>
#include <linux/security.h>
+#include <linux/seccomp.h>
#include <linux/syscalls.h>
#include <linux/tsacct_kern.h>
#include <linux/cn_proc.h>
@@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename,
if (retval)
goto out_ret;

+ retval = seccomp_check_exec();
+ if (retval)
+ goto out_ret;
+
retval = -ENOMEM;
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
if (!bprm)
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..15e2460 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -64,6 +64,9 @@
#define PR_GET_SECCOMP 21
#define PR_SET_SECCOMP 22

+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER 36
+
/* Get/set the capability bounding set (as per security/commoncap.c) */
#define PR_CAPBSET_READ 23
#define PR_CAPBSET_DROP 24
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..99d163e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,9 +5,28 @@
#ifdef CONFIG_SECCOMP

#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>

-typedef struct { int mode; } seccomp_t;
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode:
+ * if this is 0, seccomp is not in use.
+ * is 1, the process is under standard seccomp rules.
+ * is 2, the process is only allowed to make system calls where
+ * associated filters evaluate successfully.
+ * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
+ * @filter must only be accessed from the context of current as there
+ * is no guard.
+ */
+typedef struct seccomp_struct {
+ int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+ struct seccomp_filter *filter;
+#endif
+} seccomp_t;

extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)

#include <linux/errno.h>

-typedef struct { } seccomp_t;
-
+typedef struct seccomp_struct { } seccomp_t;
#define secure_computing(x) do { } while (0)

static inline long prctl_get_seccomp(void)
@@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)

#endif /* CONFIG_SECCOMP */

+#ifdef CONFIG_SECCOMP_FILTER
+
+#define seccomp_filter_init_task(_tsk) do { \
+ (_tsk)->seccomp.filter = NULL; \
+} while (0);
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+#define seccomp_filter_free_task(_tsk) do { \
+ put_seccomp_filter((_tsk)->seccomp.filter); \
+} while (0);
+
+extern int seccomp_check_exec(void);
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_filter_fork(struct task_struct *child,
+ struct task_struct *parent);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+#define seccomp_filter_init_task(_tsk) do { } while (0);
+#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
+#define seccomp_filter_free_task(_tsk) do { } while (0);
+
+static inline int seccomp_check_exec(void)
+{
+ return 0;
+}
+
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+ return -ENOSYS;
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..0584090 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index da4a6a1..cc1d628 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ seccomp_filter_free_task(tsk);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
/* Perform scheduler related setup. Assign this task to a CPU. */
sched_fork(p);

+ seccomp_filter_init_task(p);
retval = perf_event_init_task(p);
if (retval)
goto bad_fork_cleanup_policy;
@@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (clone_flags & CLONE_THREAD)
threadgroup_fork_read_unlock(current);
perf_event_fork(p);
+ seccomp_filter_fork(p, current);
return p;

bad_fork_free_pid:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..78719be 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case 2:
+ if (seccomp_test_filters(this_syscall) == 0)
+ return;
+
+ seccomp_filter_log_failure(this_syscall);
+ break;
+#endif
default:
BUG();
}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..4770847
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,639 @@
+/* bpf program-based system call filtering
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromiu...@chromium.org>
+ */
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/security.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object lifetime.
+ * get/put helpers should be used when accessing an instance
+ * outside of a lifetime-guarded section. In general, this
+ * is only needed for handling filters shared across tasks.
+ * @creator: pointer to the pid that created this filter
+ * @parent: pointer to the ancestor which this filter will be composed with.
+ * @flags: provide information about filter from creation time.
+ * @personality: personality of the process at filter creation time.
+ * @insns: the BPF program instructions to evaluate
+ * @count: the number of instructions in the program.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+ struct kref usage;
+ struct pid *creator;
+ struct seccomp_filter *parent;
+ struct {
+ uint32_t admin:1, /* can allow execve */
+ compat:1, /* CONFIG_COMPAT */
+ __reserved:30;
+ } flags;
+ int personality;
+ unsigned short count; /* Instruction count */
+ struct sock_filter insns[0];
+};
+
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *);
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ * @padding: size of the insns[0] array in bytes
+ *
+ * The @padding should be a multiple of
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
+{
+ struct seccomp_filter *f;
+ unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
+
+ /* Drop oversized requests. */
+ if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
+ return ERR_PTR(-EINVAL);
+
+ /* Padding should always be in sock_filter increments. */
+ BUG_ON(padding % sizeof(struct sock_filter));
+
+ f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
+ if (!f)
+ return ERR_PTR(-ENOMEM);
+ kref_init(&f->usage);
+ f->creator = get_task_pid(current, PIDTYPE_PID);
+ f->count = bpf_blocks;
+ return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ * @filter: NULL or live object to be completely destructed.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+ if (!filter)
+ return;
+ put_seccomp_filter(filter->parent);
+ put_pid(filter->creator);
+ kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+ struct seccomp_filter *orig =
+ container_of(kref, struct seccomp_filter, usage);
+ seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+ pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
+ current->comm, task_pid_nr(current), syscall,
+ KSTK_EIP(current));
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return;
+ kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return NULL;
+ kref_get(&orig->usage);
+ return orig;
+}
+
+static int seccomp_check_personality(struct seccomp_filter *filter)
+{
+ if (filter->personality != current->personality)
+ return -EACCES;
+#ifdef CONFIG_COMPAT
+ if (filter->flags.compat != (!!(is_compat_task())))
+ return -EACCES;
+#endif
+ return 0;
+}
+
+static const struct user_regset *
+find_prstatus(const struct user_regset_view *view)
+{
+ const struct user_regset *regset;
+ int n;
+
+ /* Skip 0. */
+ for (n = 1; n < view->n; ++n) {
+ regset = view->regsets + n;
+ if (regset->core_note_type == NT_PRSTATUS)
+ return regset;
+ }
+
+ return NULL;
+}
+
+/**
+ * seccomp_get_regs - returns a pointer to struct user_regs_struct
+ * @scratch: preallocated storage of size @available
+ * @available: pointer to the size of scratch.
+ *
+ * Returns NULL if the registers cannot be acquired or copied.
+ * Returns a populated pointer to @scratch by default.
+ * Otherwise, returns a pointer to a a u8 array containing the struct
+ * user_regs_struct appropriate for the task personality. The pointer
+ * may be to the beginning of @scratch or to an externally managed data
+ * structure. On success, @available should be updated with the
+ * valid region size of the returned pointer.
+ *
+ * If the architecture overrides the linkage, then the pointer may pointer to
+ * another location.
+ */
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+ /* regset is usually returned based on task personality, not current
+ * system call convention. This behavior makes it unsafe to execute
+ * BPF programs over regviews if is_compat_task or the personality
+ * have changed since the program was installed.
+ */
+ const struct user_regset_view *view = task_user_regset_view(current);
+ const struct user_regset *regset = &view->regsets[0];
+ size_t scratch_size = *available;
+ if (regset->core_note_type != NT_PRSTATUS) {
+ /* The architecture should override this method for speed. */
+ regset = find_prstatus(view);
+ if (!regset)
+ return NULL;
+ }
+ *available = regset->n * regset->size;
+ /* Make sure the scratch space isn't exceeded. */
+ if (*available > scratch_size)
+ *available = scratch_size;
+ if (regset->get(current, regset, 0, *available, scratch, NULL))
+ return NULL;
+ return scratch;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+ struct seccomp_filter *filter;
+ u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
+ size_t regs_size = sizeof(struct user_regs_struct);
+ int ret = -EACCES;
+
+ filter = current->seccomp.filter; /* uses task ref */
+ if (!filter)
+ goto out;
+
+ /* All filters in the list are required to share the same system call
+ * convention so only the first filter is ever checked.
+ */
+ if (seccomp_check_personality(filter))
+ goto out;
+
+ /* Grab the user_regs_struct. Normally, regs == &regs_tmp, but
+ * that is not mandatory. E.g., it may return a point to
+ * task_pt_regs(current). NULL checking is mandatory.
+ */
+ regs = seccomp_get_regs(regs_tmp, &regs_size);
+ if (!regs)
+ goto out;
+
+ /* Only allow a system call if it is allowed in all ancestors. */
+ ret = 0;
+ for ( ; filter != NULL; filter = filter->parent) {
+ /* Allowed if return value is the size of the data supplied. */
+ if (seccomp_run_filter(regs, regs_size, filter->insns) !=
+ regs_size)
+ ret = -EACCES;
+ }
+out:
+ return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+ struct seccomp_filter *filter = NULL;
+ /* Note, len is a short so overflow should be impossible. */
+ unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+ long ret = -EPERM;
+
+ /* Allocate a new seccomp_filter */
+ filter = seccomp_filter_alloc(fp_size);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+
+ /* Lock the process personality and calling convention. */
+#ifdef CONFIG_COMPAT
+ if (is_compat_task())
+ filter->flags.compat = 1;
+#endif
+ filter->personality = current->personality;
+
+ /* Auditing is not needed since the capability wasn't requested */
+ if (security_real_capable_noaudit(current, current_user_ns(),
+ CAP_SYS_ADMIN) == 0)
+ filter->flags.admin = 1;
+
+ /* Copy the instructions from fprog. */
+ ret = -EFAULT;
+ if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ goto out;
+
+ /* Check the fprog */
+ ret = sk_chk_filter(filter->insns, filter->count);
+ if (ret)
+ goto out;
+
+ /* If there is an existing filter, make it the parent
+ * and reuse the existing task-based ref.
+ */
+ filter->parent = current->seccomp.filter;
+
+ /* Force all filters to use one system call convention. */
+ ret = -EINVAL;
+ if (filter->parent) {
+ if (filter->parent->flags.compat != filter->flags.compat)
+ goto out;
+ if (filter->parent->personality != filter->personality)
+ goto out;
+ }
+
+ /* Double claim the new filter so we can release it below simplifying
+ * the error paths earlier.
+ */
+ ret = 0;
+ get_seccomp_filter(filter);
+ current->seccomp.filter = filter;
+ /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
+ if (!current->seccomp.mode) {
+ current->seccomp.mode = 2;
+ set_thread_flag(TIF_SECCOMP);
+ }
+
+out:
+ put_seccomp_filter(filter); /* for get or task, on err */
+ return ret;
+}
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+ struct sock_fprog fprog;
+ long ret = -EINVAL;
+
+ ret = -EFAULT;
+ if (!user_filter)
+ goto out;
+
+ if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+ goto out;
+
+ ret = seccomp_attach_filter(&fprog);
+out:
+ return ret;
+}
+
+/**
+ * seccomp_check_exec: determines if exec is allowed for current
+ * Returns 0 if allowed.
+ */
+int seccomp_check_exec(void)
+{
+ if (current->seccomp.mode != 2)
+ return 0;
+ /* We can rely on the task refcount for the filter. */
+ if (!current->seccomp.filter)
+ return -EPERM;
+ /* The last attached filter set for the process is checked. It must
+ * have been installed with CAP_SYS_ADMIN capabilities.
+ */
+ if (current->seccomp.filter->flags.admin)
+ return 0;
+ return -EPERM;
+}
+
+/* seccomp_filter_fork: manages inheritance on fork
+ * @child: forkee
+ * @parent: forker
+ * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
+ * and the set of filters is marked as 'enabled'.
+ */
+void seccomp_filter_fork(struct task_struct *child,
+ struct task_struct *parent)
+{
+ if (!parent->seccomp.mode)
+ return;
+ child->seccomp.mode = parent->seccomp.mode;
+ child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter);
+}
+
+/* Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries. The signature almost matches the signature from
+ * net/core/filter.c with the hopes of sharing code in the future.
+ */
+static const void *load_pointer(const u8 *buf, size_t buflen,
+ int offset, size_t size,
+ void *unused)
+{
+ if (offset >= buflen)
+ goto fail;
+ if (offset < 0)
+ goto fail;
+ if (size > buflen - offset)
+ goto fail;
+ return buf + offset;
+fail:
+ return NULL;
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF (over user_regs_struct)
+ * @buf: buffer to execute the filter over
+ * @buflen: length of the buffer
+ * @fentry: filter to apply
+ *
+ * Decode and apply filter instructions to the buffer.
+ * Return length to keep, 0 for none. @buf is a regset we are
+ * filtering, @filter is the array of filter instructions.
+ * Because all jumps are guaranteed to be before last instruction,
+ * and last instruction guaranteed to be a RET, we dont need to check
+ * flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage of
+ * optimizations (like JIT).
+ *
+ * A successful filter must return the full length of the data. Anything less
+ * will currently result in a seccomp failure. In the future, it may be
+ * possible to use that for hard filtering registers on the fly so it is
+ * ideal for consumers to return 0 on intended failure.
+ */
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *fentry)
+{
+ const void *ptr;
+ u32 A = 0; /* Accumulator */
+ u32 X = 0; /* Index Register */
+ u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */
+ u32 tmp;
+ int k;
+
+ /*
+ * Process array of filter instructions.
+ */
+ for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define K (fentry->k)
+#else
+ const u32 K = fentry->k;
+#endif
+
+ switch (fentry->code) {
+ case BPF_S_ALU_ADD_X:
+ A += X;
+ continue;
+ case BPF_S_ALU_ADD_K:
+ A += K;
+ continue;
+ case BPF_S_ALU_SUB_X:
+ A -= X;
+ continue;
+ case BPF_S_ALU_SUB_K:
+ A -= K;
+ continue;
+ case BPF_S_ALU_MUL_X:
+ A *= X;
+ continue;
+ case BPF_S_ALU_MUL_K:
+ A *= K;
+ continue;
+ case BPF_S_ALU_DIV_X:
+ if (X == 0)
+ return 0;
+ A /= X;
+ continue;
+ case BPF_S_ALU_DIV_K:
+ A = reciprocal_divide(A, K);
+ continue;
+ case BPF_S_ALU_AND_X:
+ A &= X;
+ continue;
+ case BPF_S_ALU_AND_K:
+ A &= K;
+ continue;
+ case BPF_S_ALU_OR_X:
+ A |= X;
+ continue;
+ case BPF_S_ALU_OR_K:
+ A |= K;
+ continue;
+ case BPF_S_ALU_LSH_X:
+ A <<= X;
+ continue;
+ case BPF_S_ALU_LSH_K:
+ A <<= K;
+ continue;
+ case BPF_S_ALU_RSH_X:
+ A >>= X;
+ continue;
+ case BPF_S_ALU_RSH_K:
+ A >>= K;
+ continue;
+ case BPF_S_ALU_NEG:
+ A = -A;
+ continue;
+ case BPF_S_JMP_JA:
+ fentry += K;
+ continue;
+ case BPF_S_JMP_JGT_K:
+ fentry += (A > K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_K:
+ fentry += (A >= K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_K:
+ fentry += (A == K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_K:
+ fentry += (A & K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGT_X:
+ fentry += (A > X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_X:
+ fentry += (A >= X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_X:
+ fentry += (A == X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_X:
+ fentry += (A & X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_LD_W_ABS:
+ k = K;
+load_w:
+ ptr = load_pointer(buf, buflen, k, 4, &tmp);
+ if (ptr != NULL) {
+ /* Note, unlike on network data, values are not
+ * byte swapped.
+ */
+ A = *(const u32 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_H_ABS:
+ k = K;
+load_h:
+ ptr = load_pointer(buf, buflen, k, 2, &tmp);
+ if (ptr != NULL) {
+ A = *(const u16 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_B_ABS:
+ k = K;
+load_b:
+ ptr = load_pointer(buf, buflen, k, 1, &tmp);
+ if (ptr != NULL) {
+ A = *(const u8 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_W_LEN:
+ A = buflen;
+ continue;
+ case BPF_S_LDX_W_LEN:
+ X = buflen;
+ continue;
+ case BPF_S_LD_W_IND:
+ k = X + K;
+ goto load_w;
+ case BPF_S_LD_H_IND:
+ k = X + K;
+ goto load_h;
+ case BPF_S_LD_B_IND:
+ k = X + K;
+ goto load_b;
+ case BPF_S_LDX_B_MSH:
+ ptr = load_pointer(buf, buflen, K, 1, &tmp);
+ if (ptr != NULL) {
+ X = (*(u8 *)ptr & 0xf) << 2;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_IMM:
+ A = K;
+ continue;
+ case BPF_S_LDX_IMM:
+ X = K;
+ continue;
+ case BPF_S_LD_MEM:
+ A = mem[K];
+ continue;
+ case BPF_S_LDX_MEM:
+ X = mem[K];
+ continue;
+ case BPF_S_MISC_TAX:
+ X = A;
+ continue;
+ case BPF_S_MISC_TXA:
+ A = X;
+ continue;
+ case BPF_S_RET_K:
+ return K;
+ case BPF_S_RET_A:
+ return A;
+ case BPF_S_ST:
+ mem[K] = A;
+ continue;
+ case BPF_S_STX:
+ mem[K] = X;
+ continue;
+ case BPF_S_ANC_PROTOCOL:
+ case BPF_S_ANC_PKTTYPE:
+ case BPF_S_ANC_IFINDEX:
+ case BPF_S_ANC_MARK:
+ case BPF_S_ANC_QUEUE:
+ case BPF_S_ANC_HATYPE:
+ case BPF_S_ANC_RXHASH:
+ case BPF_S_ANC_CPU:
+ case BPF_S_ANC_NLATTR:
+ case BPF_S_ANC_NLATTR_NEST:
+ /* ignored */
+ continue;
+ default:
+ WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
+ fentry->code, fentry->jt,
+ fentry->jf, fentry->k);
+ return 0;
+ }
+ }
+
+ return 0;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 481611f..77f2eda 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_SECCOMP:
error = prctl_set_seccomp(arg2);
break;
+ case PR_ATTACH_SECCOMP_FILTER:
+ error = prctl_attach_seccomp_filter((char __user *)
+ arg2);
+ break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..77b1106 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT

If you are unsure how to answer this question, answer N.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ select SECCOMP
+ depends on EXPERIMENTAL
+ help
+ This kernel feature expands CONFIG_SECCOMP to allow computing
+ in environments with reduced kernel access dictated by a system
+ call filter, expressed in BPF, installed by the application itself
+ through prctl(2).
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SECURITY
bool "Enable different security models"
depends on SYSFS

Will Drewry

unread,
Jan 11, 2012, 12:30:05 PM1/11/12
to
Document how system call filtering with BPF works
and can be used.

Signed-off-by: Will Drewry <w...@chromium.org>
---
Documentation/prctl/seccomp_filter.txt | 159 ++++++++++++++++++++++++++++++++
1 files changed, 159 insertions(+), 0 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..5fb3f44
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,159 @@
+ Seccomp filtering
+ =================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls. The filter is expressed as a Berkeley Packet
+Filter program, as with socket filters, except that the data operated on
+is the current user_regs_struct. This allows for expressive filtering
+of system calls using the pre-existing system call ABI and using a filter
+program language with a long history of being exposed to userland.
+Additionally, BPF makes it impossible for users of seccomp to fall prey to
+time-of-check-time-of-use (TOCTOU) attacks that are common in system call
+interposition frameworks because the evaluated data is solely register state
+just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing. Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by the
+consuming process. The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+ Allows the specification of a new filter using a BPF program.
+ The BPF program will be executed over a user_regs_struct data
+ reflecting system call time except with the system call number
+ resident in orig_[register]. To allow a system call, the size
+ of the data must be returned. At present, all other return values
+ result in the system call being blocked, but it is recommended to
+ return 0 in those cases. This will allow for future custom return
+ values to be introduced, if ever desired.
+
+ Usage:
+ prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+ The 'prog' argument is a pointer to a struct sock_fprog which will
+ contain the filter program. If the program is invalid, the call
+ will return -1 and set errno to -EINVAL.
+
+ The struct user_regs_struct the @prog will see is based on the
+ personality of the task at the time of this prctl call. Additionally,
+ is_compat_task is also tracked for the @prog. This means that once set
+ the calling task will have all of its system calls blocked if it
+ switches its system call ABI (via personality or other means).
+
+ If the @prog is installed while the task has CAP_SYS_ADMIN in its user
+ namespace, the @prog will be marked as inheritable across execve. Any
+ inherited filters are still subject to the system call ABI constraints
+ above and any ABI mismatched system calls will result in process death.
+
+All of the above calls return 0 on success and non-zero on error.
+
+
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/err and exit
+cleanly. Without using a BPF compiler, it may be done as follows on x86 32-bit:
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+int install_filter(void)
+{
+ struct sock_filter filter[] = {
+ /* Grab the system call number */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
+ /* Jump table for the allowed syscalls */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+ /* Check that read is only using stdin. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+ /* Check that write is only using stdout/stderr */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+ /* Put the "accept" value in A */
+ BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+ BPF_STMT(BPF_RET+BPF_A,0),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (prctl(36, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ return 0;
+}
+
+Additionally, if prctl(2) is allowed by the installed filter, additional
+filters may be layered on which will increase evaluation time, but allow for
+further decreasing the attack surface during execution of a process.
+
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+ a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.

Jonathan Corbet

unread,
Jan 11, 2012, 3:10:01 PM1/11/12
to
Interesting approach to the problem, I think I like it. Watch for news at
11...:)

One nit:

> +Example
> +-------
> +
> +Assume a process would like to cleanly read and write to stdin/out/err and exit
> +cleanly. Without using a BPF compiler, it may be done as follows on x86 32-bit:
> +

It seems like this little program belongs in the samples/ directory.

Thanks,

jon

Will Drewry

unread,
Jan 11, 2012, 3:20:02 PM1/11/12
to
On Wed, Jan 11, 2012 at 2:03 PM, Jonathan Corbet <cor...@lwn.net> wrote:
> Interesting approach to the problem, I think I like it.  Watch for news at
> 11...:)

Thanks - I'm glad to hear it!

> One nit:
>
>> +Example
>> +-------
>> +
>> +Assume a process would like to cleanly read and write to stdin/out/err and exit
>> +cleanly.  Without using a BPF compiler, it may be done as follows on x86 32-bit:
>> +
>
> It seems like this little program belongs in the samples/ directory.

Cool - I'll do that and rev this patch.

cheers!
will

Will Drewry

unread,
Jan 11, 2012, 6:30:02 PM1/11/12
to
Document how system call filtering with BPF works and
may be used. Includes an example for x86 (32-bit).

Signed-off-by: Will Drewry <w...@chromium.org>
---
Documentation/prctl/seccomp_filter.txt | 99 ++++++++++++++++++++++++++++++++
samples/Makefile | 2 +-
samples/seccomp/Makefile | 12 ++++
samples/seccomp/bpf-example.c | 74 ++++++++++++++++++++++++
4 files changed, 186 insertions(+), 1 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
create mode 100644 samples/seccomp/Makefile
create mode 100644 samples/seccomp/bpf-example.c

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..15d4645
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,99 @@
+ Additionally, if prctl(2) is allowed by the attached filter,
+ additional filters may be layered on which will increase evaluation
+ time, but allow for further decreasing the attack surface during
+ execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Example
+-------
+
+samples/seccomp-bpf-example.c shows an example process that allows read from stdin,
+write to stdout/err, exit and signal returns for 32-bit x86.
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+ a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
# Makefile for Linux samples code

obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..80dc8e4
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,12 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-$(CONFIG_X86_32) := bpf-example
+bpf-example-objs := bpf-example.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-example.o += -I$(objtree)/usr/include -m32
+HOSTLOADLIBES_bpf-example += -m32
diff --git a/samples/seccomp/bpf-example.c b/samples/seccomp/bpf-example.c
new file mode 100644
index 0000000..f98b70a
--- /dev/null
+++ b/samples/seccomp/bpf-example.c
@@ -0,0 +1,74 @@
+/*
+ * Seccomp BPF example
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromiu...@chromium.org>
+ * Author: Will Drewry <w...@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/prctl.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+# define PR_ATTACH_SECCOMP_FILTER 36
+#endif
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+static int install_filter(void)
+ if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ return 0;
+}
--
1.7.5.4

Will Drewry

unread,
Jan 11, 2012, 7:30:02 PM1/11/12
to
Hrm, I may need to guard sample compilation based on host arch and not
just target arch. Documentation v3 will be on the way once I have that
behaving properly. :/

Sorry!
will

Serge Hallyn

unread,
Jan 12, 2012, 4:00:02 AM1/12/12
to
Hey Will,

A few comments below, but otherwise

Acked-by: Serge Hallyn <serge....@canonical.com>

thanks,
-serge
I still think the BUG_ON here is harsh given that the progsize is passed
in by userspace. Was there a reason not to return -EINVAL here?
This comment is confusing. By 'It must' you mean that if not, it's
denied. But if I didn't know better I would read that as "we can't
get to this code unless". Can you change it to something like
"Exec is refused unless the filter was installed with CAP_SYS_ADMIN
privilege"?

Łukasz Sowa

unread,
Jan 12, 2012, 8:20:01 AM1/12/12
to
Hi Will,

That's very different approach to the system call interposition problem.
I find you solution very interesting. It gives far more capabilities
than my syscalls cgroup that you commented on some time ago. It's ready
now but I haven't tried filtering yet. I think that if your solution
make it to the mainline (and I guess that's really possible at current
stage :)), there will be no place for mine solution but that's ok.

There's one thing that I'm curious about - have you measured overhead in
any way? That was one of the biggest issues in all previous attempts to
limit syscalls. I'd love to compare the numbers with mine solution.

I'll examine your patch later on and put some comments if I bump into
something.

Best Regards,
Lukasz Sowa

Oleg Nesterov

unread,
Jan 12, 2012, 10:00:02 AM1/12/12
to
On 01/11, Will Drewry wrote:
>
> This patch adds support for seccomp mode 2. This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task. The policy is expressed in terms of a BPF program,
> as is used for userland-exposed socket filtering. Instead of network
> data, the BPF program is evaluated over struct user_regs_struct at the
> time of the system call (as retrieved using regviews).

Cool ;)

I didn't really read this patch yet, just one nit.

> +#define seccomp_filter_init_task(_tsk) do { \
> + (_tsk)->seccomp.filter = NULL; \
> +} while (0);

Cosmetic and subjective, but imho it would be better to add inline
functions instead of define's.

> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
> free_thread_info(tsk->stack);
> rt_mutex_debug_task_free(tsk);
> ftrace_graph_exit_task(tsk);
> + seccomp_filter_free_task(tsk);
> free_task_struct(tsk);
> }
> EXPORT_SYMBOL(free_task);
> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> /* Perform scheduler related setup. Assign this task to a CPU. */
> sched_fork(p);
>
> + seccomp_filter_init_task(p);

This doesn't look right or I missed something. something seccomp_filter_init_task()
should be called right after dup_task_struct(), at least before copy process can
fail.

Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can put
current->seccomp.filter copied by arch_dup_task_struct().

> +struct seccomp_filter {
> + struct kref usage;
> + struct pid *creator;

Why? seccomp_filter->creator is never used, no?

Oleg.

Steven Rostedt

unread,
Jan 12, 2012, 10:50:02 AM1/12/12
to
On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:

> Filter programs may _only_ cross the execve(2) barrier if last filter
> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> user namespace. Once a task-local filter program is attached from a
> process without privileges, execve will fail. This ensures that only
> privileged parent task can affect its privileged children (e.g., setuid
> binary).

This means that a non privileged user can not run another program with
limited features? How would a process exec another program and filter
it? I would assume that the filter would need to be attached first and
then the execv() would be performed. But after the filter is attached,
the execv is prevented?

Maybe I don't understand this correctly.

-- Steve

Andrew Lutomirski

unread,
Jan 12, 2012, 11:20:02 AM1/12/12
to
On Thu, Jan 12, 2012 at 7:43 AM, Steven Rostedt <ros...@goodmis.org> wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace.  Once a task-local filter program is attached from a
>> process without privileges, execve will fail.  This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?
>
> Maybe I don't understand this correctly.

Time to resurrect execve_nosecurity? If so, then the rule could be
simplified to: seccomp programs cannot use normal execve at all.

The longer I linger on lists and see neat ideas like this, the more I
get annoyed that execve is magical. I dream of a distribution that
doesn't use setuid, file capabilities, selinux transitions on exec, or
any other privilege changes on exec *at all*. I think that the only
things missing in the kernel (other than something intelligent to do
about SELinux) are execve_nosecurity and the ability for a normal
program to wait for an unrelated program to finish (or some other way
that a program can ask a daemon to spawn a privileged program for it
and then to cleanly wait for that program to finish in a way that
could survive re-exec of the daemon).

--Andy

Oleg Nesterov

unread,
Jan 12, 2012, 11:30:02 AM1/12/12
to
On 01/11, Will Drewry wrote:
>
Stupid question. I am sure you know what are you doing ;) and I know
nothing about !x86 arches.

But could you explain why it is designed to use user_regs_struct ?
Why we can't simply use task_pt_regs() and avoid the (costly) regsets?

Just curious.

Oleg.

Oleg Nesterov

unread,
Jan 12, 2012, 11:30:02 AM1/12/12
to
On 01/12, Steven Rostedt wrote:
>
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
> > Filter programs may _only_ cross the execve(2) barrier if last filter
> > program was attached by a task with CAP_SYS_ADMIN capabilities in its
> > user namespace. Once a task-local filter program is attached from a
> > process without privileges, execve will fail. This ensures that only
> > privileged parent task can affect its privileged children (e.g., setuid
> > binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?
>
> Maybe I don't understand this correctly.

May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.

OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.

Oleg.

Alan Cox

unread,
Jan 12, 2012, 11:30:02 AM1/12/12
to
> Filter programs may _only_ cross the execve(2) barrier if last filter
> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> user namespace. Once a task-local filter program is attached from a
> process without privileges, execve will fail. This ensures that only
> privileged parent task can affect its privileged children (e.g., setuid
> binary).

I think this model is wrong. The rest of the policy rules all work on the
basis that dumpable is the decider (the same rules for not dumping, not
tracing, etc). A user should be able to apply filter to their own code
arbitarily. Any setuid app should IMHO lose the trace subject to the usual
uid rules and capability rules. That would seem to be more flexible and
also the path of least surprise.

[plus you can implement non setuid exec entirely in userspace so it's
a rather meaningless distinction you propose]

> be tackled separately via separate patchsets. (And at some point sharing
> BPF JIT code!)

A BPF jit ought to be trivial and would be a big win.

In general I like this approach. It's simple, it's compact and it offers
interesting possibilities for solving some interesting problem spaces,
without the full weight of SELinux, SMACK etc which are still needed for
heavyweight security.

Alan

Steven Rostedt

unread,
Jan 12, 2012, 11:30:02 AM1/12/12
to
On Thu, 2012-01-12 at 08:14 -0800, Andrew Lutomirski wrote:

> The longer I linger on lists and see neat ideas like this, the more I
> get annoyed that execve is magical. I dream of a distribution that
> doesn't use setuid, file capabilities, selinux transitions on exec, or
> any other privilege changes on exec *at all*.

Is that the fear with filtering on execv? That if we have filters on an
execv calling a setuid program that we change the behavior of that
privileged program and might cause unexpected results?

In that case, just have execv fail if filtering is enabled and we are
execing a setuid program. But I don't see why non "magical" execv's
should be prohibited.

Steven Rostedt

unread,
Jan 12, 2012, 11:40:02 AM1/12/12
to
On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:

> May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
>
> OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.

I've never used seccomp, so I admit I'm totally ignorant on this topic.

But looking at seccomp from the outside, the biggest advantage to this
would be the ability for normal processes to be able to limit tasks it
kicks off. If I want to run a task in a sandbox, I don't want to be root
to do so.

I guess a web browser doesn't perform an exec to run java programs. But
it would be nice if I could execute something from the command line that
I could run in a sand box.

What's the problem with making sure that the setuid isn't set before
doing an execv? Only fail when setuid (or some other magic) is enabled
on the file being exec'd.

Or is this a race where I can have a soft link pointing to a normal
file, run this, and have the link change to a setuid file at just the
right time that causes it to happen?


-- Steve

Oleg Nesterov

unread,
Jan 12, 2012, 12:00:02 PM1/12/12
to
On 01/12, Steven Rostedt wrote:
>
> On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:
>
> > May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> > cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
> >
> > OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
>
> I've never used seccomp, so I admit I'm totally ignorant on this topic.

me too ;)

> But looking at seccomp from the outside, the biggest advantage to this
> would be the ability for normal processes to be able to limit tasks it
> kicks off. If I want to run a task in a sandbox, I don't want to be root
> to do so.
>
> I guess a web browser doesn't perform an exec to run java programs. But
> it would be nice if I could execute something from the command line that
> I could run in a sand box.
>
> What's the problem with making sure that the setuid isn't set before
> doing an execv? Only fail when setuid (or some other magic) is enabled
> on the file being exec'd.

I agree. That is why I mentioned LSM_UNSAFE_SECCOMP/cap_bprm_set_creds.
Just I do not know what would be the most simple/clean way to do this.


And in any case I agree that the current seccomp_check_exec() looks
strange. Btw, it does
{
if (current->seccomp.mode != 2)
return 0;
/* We can rely on the task refcount for the filter. */
if (!current->seccomp.filter)
return -EPERM;

How it is possible to have seccomp.filter == NULL with mode == 2?

Oleg.

Will Drewry

unread,
Jan 12, 2012, 12:00:02 PM1/12/12
to
On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <ros...@goodmis.org> wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace.  Once a task-local filter program is attached from a
>> process without privileges, execve will fail.  This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?

Yeah - it means tasks can filter themselves, but not each other.
However, you can inject a filter for any dynamically linked executable
using LD_PRELOAD.

> Maybe I don't understand this correctly.

You're right on. This was to ensure that one process didn't cause
crazy behavior in another. I think Alan has a better proposal than
mine below. (Goes back to catching up.)

Will Drewry

unread,
Jan 12, 2012, 12:00:01 PM1/12/12
to
Thanks! Unimportant responses below. Fixes will be incorporated in
the next round (along with Oleg's feedback).

cheers,
will
I've changed it in the next revision. As is, I don't believe
userspace can control
the size of padding directly, just the increment since it specifies
its length in terms
of bpf blocks (sizeof(struct sock_filter)). But EINVAL is certainly
less aggressive :)
Sounds good!

Will Drewry

unread,
Jan 12, 2012, 12:00:02 PM1/12/12
to
On Thu, Jan 12, 2012 at 8:50 AM, Oleg Nesterov <ol...@redhat.com> wrote:
> On 01/11, Will Drewry wrote:
>>
>> This patch adds support for seccomp mode 2.  This mode enables dynamic
>> enforcement of system call filtering policy in the kernel as specified
>> by a userland task.  The policy is expressed in terms of a BPF program,
>> as is used for userland-exposed socket filtering.  Instead of network
>> data, the BPF program is evaluated over struct user_regs_struct at the
>> time of the system call (as retrieved using regviews).
>
> Cool ;)
>
> I didn't really read this patch yet, just one nit.
>
>> +#define seccomp_filter_init_task(_tsk) do { \
>> +     (_tsk)->seccomp.filter = NULL; \
>> +} while (0);
>
> Cosmetic and subjective, but imho it would be better to add inline
> functions instead of define's.

Refactoring it a bit to make that possible. Since seccomp fork/init/free
never needs access to the whole task_structs, I'll just pass in what's
needed (and avoid the sched.h inclusion recursion).

Comments on the next round will most definitely be appreciated!

>> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
>>       free_thread_info(tsk->stack);
>>       rt_mutex_debug_task_free(tsk);
>>       ftrace_graph_exit_task(tsk);
>> +     seccomp_filter_free_task(tsk);
>>       free_task_struct(tsk);
>>  }
>>  EXPORT_SYMBOL(free_task);
>> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>>       /* Perform scheduler related setup. Assign this task to a CPU. */
>>       sched_fork(p);
>>
>> +     seccomp_filter_init_task(p);
>
> This doesn't look right or I missed something. something seccomp_filter_init_task()
> should be called right after dup_task_struct(), at least before copy process can
> fail.
>
> Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can put
> current->seccomp.filter copied by arch_dup_task_struct().

Ah - makes sense! I moved it under dup_task_struct before any goto's
to bad_fork_free.

>> +struct seccomp_filter {
>> +     struct kref usage;
>> +     struct pid *creator;
>
> Why? seccomp_filter->creator is never used, no?

Removing it. It is from a related patch I'm experimenting with (adding
optional tracehook support), but it has no bearing here.

Thanks - new patch revision incoming!
will

Andrew Lutomirski

unread,
Jan 12, 2012, 12:00:03 PM1/12/12
to
On Thu, Jan 12, 2012 at 8:27 AM, Steven Rostedt <ros...@goodmis.org> wrote:
> On Thu, 2012-01-12 at 08:14 -0800, Andrew Lutomirski wrote:
>
>> The longer I linger on lists and see neat ideas like this, the more I
>> get annoyed that execve is magical.  I dream of a distribution that
>> doesn't use setuid, file capabilities, selinux transitions on exec, or
>> any other privilege changes on exec *at all*.
>
> Is that the fear with filtering on execv? That if we have filters on an
> execv calling a setuid program that we change the behavior of that
> privileged program and might cause unexpected results?

Exactly.

>
> In that case, just have execv fail if filtering is enabled and we are
> execing a setuid program. But I don't see why non "magical" execv's
> should be prohibited.
>

How do you define "non-magical"?

If setuid is set, then it's obviously magical. On a nosuid
filesystem, strange things happen. If file capabilities are enabled
and set, then different magic happens. With LSMs involved, anything
can be magical. (SELinux AFAICT looks up rules on every single exec,
so it might be impossible for execve to be non-magical.) If execve is
banned entirely when seccomp is enabled, then there will never be any
attacks based on abusing these mechanisms.

My proposal is to have an alternative mechanism that, from a security
POV, does nothing that the caller couldn't have done on its own. The
only reason it would be needed at all is because implementing execve
with correct semantics from userspace is a PITA -- the right set of
fds needs to be closed, threads need to be killed (without races),
vmas need to be found an unmapped, a new program needs to be mapped in
(possibly at the same place that the old one was mapped at),
/proc/self/exe needs to be updated, auxv needs to be recreated
(including using values that glibc might have erased already), etc.

The code is short and it works (although I have no idea whether it
applies to current kernels).

Oleg: my only issue with setting something like LSM_UNSAFE_SECCOMP is
that a different class of vulnerability might be introduced: take a
setuid program that calls other setuid programs (or just uses execve
as a way to get the default execve capability handling, SELinux
handling, etc), run it (as root!) inside seccomp, and watch it
possibly develop security holes. If the alternate execve is a
different syscall, then this can't happen. And if someone remaps
execve to execve_nosecurity (from userspace or via some in-kernel
mechanism) and causes problems, it's entirely clear who to blame.

--Andy

Will Drewry

unread,
Jan 12, 2012, 12:10:02 PM1/12/12
to
On Thu, Jan 12, 2012 at 10:18 AM, Alan Cox <al...@lxorguk.ukuu.org.uk> wrote:
>> Filter programs may _only_ cross the execve(2) barrier if last filter
>> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> user namespace.  Once a task-local filter program is attached from a
>> process without privileges, execve will fail.  This ensures that only
>> privileged parent task can affect its privileged children (e.g., setuid
>> binary).
>
> I think this model is wrong. The rest of the policy rules all work on the
> basis that dumpable is the decider (the same rules for not dumping, not
> tracing, etc). A user should be able to apply filter to their own code
> arbitarily. Any setuid app should IMHO lose the trace subject to the usual
> uid rules and capability rules. That would seem to be more flexible and
> also the path of least surprise.

My line of thinking up to now has been that disallowing setuid exec
would mean there is no risk of an errant setuid binary allowing escape
from the system call filters (which the containers people may care
more about). Since setuid is privilege escalation, then perhaps it
makes sense to allow it as an escape hatch.

Would it be sane to just disallow setuid exec exclusively?

> [plus you can implement non setuid exec entirely in userspace so it's
> a rather meaningless distinction you propose]

Agreed.

>> be tackled separately via separate patchsets. (And at some point sharing
>> BPF JIT code!)
>
> A BPF jit ought to be trivial and would be a big win.
>
> In general I like this approach. It's simple, it's compact and it offers
> interesting possibilities for solving some interesting problem spaces,
> without the full weight of SELinux, SMACK etc which are still needed for
> heavyweight security.
>

Thanks! Yeah I think merging with the network stack is eminently
doable, but I didn't want to bog down the proposal in how much
overhead I might be adding to the network layer.

Will Drewry

unread,
Jan 12, 2012, 12:10:02 PM1/12/12
to
It shouldn't be. It's another relic I missed from development. (Adding to v3 :)

Andrew Lutomirski

unread,
Jan 12, 2012, 12:10:03 PM1/12/12
to
On Wed, Jan 11, 2012 at 9:25 AM, Will Drewry <w...@chromium.org> wrote:
> This patch adds support for seccomp mode 2.  This mode enables dynamic
> enforcement of system call filtering policy in the kernel as specified
> by a userland task.  The policy is expressed in terms of a BPF program,
> as is used for userland-exposed socket filtering.  Instead of network
> data, the BPF program is evaluated over struct user_regs_struct at the
> time of the system call (as retrieved using regviews).
>

There's some seccomp-related code in the vsyscall emulation path in
arch/x86/kernel/vsyscall_64.c. How should time(), getcpu(), and
gettimeofday() be handled? If you want filtering to work, there
aren't any real syscall registers to inspect, but they could be
synthesized.

Preventing a malicious task from figuring out approximately what time
it is is basically impossible because of the way that vvars work. I
don't know how to change that efficiently.

--Andy

Alan Cox

unread,
Jan 12, 2012, 12:20:02 PM1/12/12
to
> more about). Since setuid is privilege escalation, then perhaps it
> makes sense to allow it as an escape hatch.
>
> Would it be sane to just disallow setuid exec exclusively?

I think that is a policy question. I can imagine cases where either
behaviour is the "right" one so it may need to be a parameter ?

Alan

Will Drewry

unread,
Jan 12, 2012, 12:20:02 PM1/12/12
to
So on x86 32, it would work since user_regs_struct == task_pt_regs
(iirc), but on x86-64
and others, that's not true. I don't think it's kosher to expose
pt_regs to the userspace, but if, let's say, x86-32 overrides the weak
linkage, then it could just return task_pt_regs and be the fastest
path.

If it would be appropriate to expose pt_regs to userspace, then I'd
happily do so :)

Steven Rostedt

unread,
Jan 12, 2012, 12:20:02 PM1/12/12
to
On Thu, 2012-01-12 at 09:09 -0800, Linus Torvalds wrote:

> The whole "fail security escalations" thing goes way beyond just
> filtering, I think we could seriously try to make it a generic
> feature.

After I wrote this comment I thought the same thing. It would be nice to
have a way to just set a flag to a process that will prevent it from
doing any escalating of privileges.

I totally agree, this would solve a whole host of issues with regard to
security issues in things that shouldn't be a problem but currently are.

Randy Dunlap

unread,
Jan 12, 2012, 12:20:02 PM1/12/12
to
On 01/11/2012 03:19 PM, Will Drewry wrote:
> Document how system call filtering with BPF works and
> may be used. Includes an example for x86 (32-bit).

Please tell some of us what "BPF" means. wikipedia lists 15 possible
choices, but I don't know which one to choose.

> Signed-off-by: Will Drewry <w...@chromium.org>
> ---
> Documentation/prctl/seccomp_filter.txt | 99 ++++++++++++++++++++++++++++++++
> samples/Makefile | 2 +-
> samples/seccomp/Makefile | 12 ++++
> samples/seccomp/bpf-example.c | 74 ++++++++++++++++++++++++
> 4 files changed, 186 insertions(+), 1 deletions(-)
> create mode 100644 Documentation/prctl/seccomp_filter.txt
> create mode 100644 samples/seccomp/Makefile
> create mode 100644 samples/seccomp/bpf-example.c


--
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

Linus Torvalds

unread,
Jan 12, 2012, 12:20:02 PM1/12/12
to
On Thu, Jan 12, 2012 at 8:27 AM, Steven Rostedt <ros...@goodmis.org> wrote:
>
> In that case, just have execv fail if filtering is enabled and we are
> execing a setuid program. But I don't see why non "magical" execv's
> should be prohibited.

The whole "fail security escalations" thing goes way beyond just
filtering, I think we could seriously try to make it a generic
feature.

For example, somebody just asked me the other day why "chroot()"
requires admin privileges, since it would be good to limit even
non-root things.

And it's really the exact same issue as filtering: in some sense,
chroot() "filters" FS name lookups, and can be used to fool programs
that are written to be secure.

We could easily introduce a per-process flag that just says "cannot
escalate privileges". Which basically just disables execve() of
suid/sgid programs (and possibly other things too), and locks the
process to the current privileges. And then make the rule be that *if*
that flag is set, you can then filter across an execve, or chroot as a
normal user, or whatever.

There are probably other things like that - things like allowing users
to do bind mounts etc - that aren't dangerous in themselves, but that
are dangerous mainly because they can be used to fool things into
privilege escalations. So this is definitely not a filter-only issue.

Linus

Will Drewry

unread,
Jan 12, 2012, 12:30:01 PM1/12/12
to
On Thu, Jan 12, 2012 at 12:16 PM, Randy Dunlap <rdu...@xenotime.net> wrote:
> On 01/11/2012 03:19 PM, Will Drewry wrote:
>> Document how system call filtering with BPF works and
>> may be used.  Includes an example for x86 (32-bit).
>
> Please tell some of us what "BPF" means.  wikipedia lists 15 possible
> choices, but I don't know which one to choose.

I'll make it clearer in the documentation file and update the patch description.

BPF == Berkeley Packet Filters which are implemented in Linux Socket
Filters (LSF)>

thanks!

Will Drewry

unread,
Jan 12, 2012, 12:30:01 PM1/12/12
to
On Thu, Jan 12, 2012 at 7:13 AM, Łukasz Sowa <luk...@gmail.com> wrote:
> Hi Will,
>
> That's very different approach to the system call interposition problem.
> I find you solution very interesting. It gives far more capabilities
> than my syscalls cgroup that you commented on some time ago. It's ready
> now but I haven't tried filtering yet. I think that if your solution
> make it to the mainline (and I guess that's really possible at current
> stage :)), there will be no place for mine solution but that's ok.

Yeah - there've been so many tries, I'll be happy when one makes it in
which is usable :)

> There's one thing that I'm curious about - have you measured overhead in
> any way? That was one of the biggest issues in all previous attempts to
> limit syscalls. I'd love to compare the numbers with mine solution.

Certainly. I have some rough numbers, but nothing I'd call strong
measurements. There is still a fair amount of cost due to the syscall
slow path.

> I'll examine your patch later on and put some comments if I bump into
> something.

Much appreciated - cheers!
will

Jamie Lokier

unread,
Jan 12, 2012, 12:40:01 PM1/12/12
to
Steven Rostedt wrote:
> On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>
> > Filter programs may _only_ cross the execve(2) barrier if last filter
> > program was attached by a task with CAP_SYS_ADMIN capabilities in its
> > user namespace. Once a task-local filter program is attached from a
> > process without privileges, execve will fail. This ensures that only
> > privileged parent task can affect its privileged children (e.g., setuid
> > binary).
>
> This means that a non privileged user can not run another program with
> limited features? How would a process exec another program and filter
> it? I would assume that the filter would need to be attached first and
> then the execv() would be performed. But after the filter is attached,
> the execv is prevented?

Ugly method: Using ptrace(), trap after the execve() and issue fake
syscalls to install the filter. I feel dirty thinking it, in a good way.

LD_PRELOAD has been suggested. It's not 100% reliable because not all
executables are dynamic (on some uClinux platforms none of them are),
but it will usually work.

-- Jamie

Will Drewry

unread,
Jan 12, 2012, 12:40:01 PM1/12/12
to
On Thu, Jan 12, 2012 at 11:22 AM, Jamie Lokier <ja...@shareable.org> wrote:
> Will Drewry wrote:
>> On Thu, Jan 12, 2012 at 9:43 AM, Steven Rostedt <ros...@goodmis.org> wrote:
>> > On Wed, 2012-01-11 at 11:25 -0600, Will Drewry wrote:
>> >
>> >> Filter programs may _only_ cross the execve(2) barrier if last filter
>> >> program was attached by a task with CAP_SYS_ADMIN capabilities in its
>> >> user namespace.  Once a task-local filter program is attached from a
>> >> process without privileges, execve will fail.  This ensures that only
>> >> privileged parent task can affect its privileged children (e.g., setuid
>> >> binary).
>> >
>> > This means that a non privileged user can not run another program with
>> > limited features? How would a process exec another program and filter
>> > it? I would assume that the filter would need to be attached first and
>> > then the execv() would be performed. But after the filter is attached,
>> > the execv is prevented?
>>
>> Yeah - it means tasks can filter themselves, but not each other.
>> However, you can inject a filter for any dynamically linked executable
>> using LD_PRELOAD.
>>
>> > Maybe I don't understand this correctly.
>>
>> You're right on.  This was to ensure that one process didn't cause
>> crazy behavior in another. I think Alan has a better proposal than
>> mine below.  (Goes back to catching up.)
>
> You can already use ptrace() to cause crazy behaviour in another
> process, including modifying registers arbitrarily at syscall entry
> and exit, aborting and emulating syscalls.
>
> ptrace() is quite slow and it would be really nice to speed it up,
> especially for trapping a small subset of syscalls, or limiting some
> kinds of access to some file descriptors, while everything else runs
> at normal speed.
>
> Speeding up ptrace() with BPF filters would be a really nice.  Not
> that I like ptrace(), but sometimes it's the only thing you can rely on.
>
> LD_PRELOAD and code running in the target process address space can't
> always be trusted in some contexts (e.g. the target process may modify
> the tracing code or its data); whereas ptrace() is pretty complete and
> reliable, if ugly.
>
> There's already a security model around who can use ptrace(); speeding
> it up needn't break that.
>
> If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
> needed as userspace could have done it, with exactly the restrictions
> it wants.  Google's NaCl comes to mind as a potential user.

That's not entirely true. ptrace supervisors are subject to races and
always fail open. This makes them effective but not as robust as a
seccomp solution can provide.

With seccomp, it fails close. What I think would make sense would be
to add a user-controllable failure mode with seccomp bpf that calls
tracehook_ptrace_syscall_entry(regs). I've prototyped this and it
works quite well, but I didn't want to conflate the discussions.

Using ptrace() would also mean that all consumers of this interface
would need a supervisor, but with seccomp, the filters are installed
and require no supervisors to stick around for when failure occurs.

Does that make sense?
thanks!
will

Steven Rostedt

unread,
Jan 12, 2012, 12:40:01 PM1/12/12
to
On Thu, 2012-01-12 at 11:23 -0600, Will Drewry wrote:

> > Please tell some of us what "BPF" means. wikipedia lists 15 possible
> > choices, but I don't know which one to choose.
>
> I'll make it clearer in the documentation file and update the patch description.
>
> BPF == Berkeley Packet Filters which are implemented in Linux Socket
> Filters (LSF)>
>

I admit, I was totally clueless in what it meant too ;)

Even the LWN article didn't explain (shame on you Jon).

"he has repurposed the networking layer's packet filtering mechanism
(BPF)"

I didn't know what did the "B" stood for.

-- Steve

Jamie Lokier

unread,
Jan 12, 2012, 12:40:01 PM1/12/12
to
Steven Rostedt wrote:
> On Thu, 2012-01-12 at 17:14 +0100, Oleg Nesterov wrote:
>
> > May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> > cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
> >
> > OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
>
> I've never used seccomp, so I admit I'm totally ignorant on this topic.
>
> But looking at seccomp from the outside, the biggest advantage to this
> would be the ability for normal processes to be able to limit tasks it
> kicks off. If I want to run a task in a sandbox, I don't want to be root
> to do so.
>
> I guess a web browser doesn't perform an exec to run java programs.

Actually it does. Firefox on Linux forks and execs the Java VM.
Same for Flash, using "plugin-container".

> But it would be nice if I could execute something from the command
> line that I could run in a sand box.

You can do this now, using ptrace(). It's horrible, but half of the
horribleness is needing to understand machine-dependent registers,
which this new patch doesn't address. (The other half is a ton of
undocumented but important ptrace() behaviours on Linux.)

-- Jamie

Oleg Nesterov

unread,
Jan 12, 2012, 12:40:02 PM1/12/12
to
On 01/12, Will Drewry wrote:
>
> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <ol...@redhat.com> wrote:
> >> +      */
> >> +     regs = seccomp_get_regs(regs_tmp, &regs_size);
> >
> > Stupid question. I am sure you know what are you doing ;) and I know
> > nothing about !x86 arches.
> >
> > But could you explain why it is designed to use user_regs_struct ?
> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>
> So on x86 32, it would work since user_regs_struct == task_pt_regs
> (iirc), but on x86-64
> and others, that's not true.

Yes sure, I meant that userpace should use pt_regs too.

> If it would be appropriate to expose pt_regs to userspace, then I'd
> happily do so :)

Ah, so that was the reason. But it is already exported? At least I see
the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.

Once again, I am not arguing, just trying to understand. And I do not
know if this definition is part of abi.

Oleg.

Jamie Lokier

unread,
Jan 12, 2012, 12:50:01 PM1/12/12
to
Steven Rostedt wrote:
> On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:
>
> > You can do this now, using ptrace(). It's horrible, but half of the
> > horribleness is needing to understand machine-dependent registers,
> > which this new patch doesn't address. (The other half is a ton of
> > undocumented but important ptrace() behaviours on Linux.)
>
> Yeah I know the horrid use of ptrace, I've implemented programs that use
> it :-p

That warm fuzzy feeling :-)

> I guess ptrace can capture the execv and determine if it is OK or not to
> run it. But again, this doesn't stop the possible attacks that could
> happen, with having the execv on a symlink file, having the ptrace check
> say its OK, and then switching the symlink to a setuid file.
>
> When the new execv executed, the parent process would lose all control
> over it. The idea is to prevent this.

fexecve() exists to solve the problem.
Also known as execve("/proc/self/fd/...") on Linux.

> I like Alan's suggestion. Have userspace decide to allow execv or not,
> and even let it decide if it should allow setuid execv's or not, but
> still allow non-setuid execvs. If you allow the setuid execv, once that
> happens, the same behavior will occur as with ptrace. A setuid execv
> will lose all its filtering.

I like the idea of letting the tracer decide what it wants.

Steven Rostedt

unread,
Jan 12, 2012, 12:50:01 PM1/12/12
to
On Thu, 2012-01-12 at 17:30 +0000, Jamie Lokier wrote:

> You can do this now, using ptrace(). It's horrible, but half of the
> horribleness is needing to understand machine-dependent registers,
> which this new patch doesn't address. (The other half is a ton of
> undocumented but important ptrace() behaviours on Linux.)

Yeah I know the horrid use of ptrace, I've implemented programs that use
it :-p

I guess ptrace can capture the execv and determine if it is OK or not to
run it. But again, this doesn't stop the possible attacks that could
happen, with having the execv on a symlink file, having the ptrace check
say its OK, and then switching the symlink to a setuid file.

When the new execv executed, the parent process would lose all control
over it. The idea is to prevent this.

I like Alan's suggestion. Have userspace decide to allow execv or not,
and even let it decide if it sh