Signed-off-by: Will Drewry <w...@chromium.org>
---
Documentation/prctl/seccomp_filter.txt | 159 ++++++++++++++++++++++++++++++++
1 files changed, 159 insertions(+), 0 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..5fb3f44
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,159 @@
+ Seccomp filtering
+ =================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter
+for incoming system calls. The filter is expressed as a Berkeley Packet
+Filter program, as with socket filters, except that the data operated on
+is the current user_regs_struct. This allows for expressive filtering
+of system calls using the pre-existing system call ABI and using a filter
+program language with a long history of being exposed to userland.
+Additionally, BPF makes it impossible for users of seccomp to fall prey to
+time-of-check-time-of-use (TOCTOU) attacks that are common in system call
+interposition frameworks because the evaluated data is solely register state
+just after system call entry.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing. Expressive, dynamic filters provide further options down
+this path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added, but they are not directly set by the
+consuming process. The new mode, '2', is only available if
+CONFIG_SECCOMP_FILTER is set and enabled using prctl with the
+PR_ATTACH_SECCOMP_FILTER argument.
+
+Interacting with seccomp filters is done using one prctl(2) call.
+
+PR_ATTACH_SECCOMP_FILTER:
+ Allows the specification of a new filter using a BPF program.
+ The BPF program will be executed over a user_regs_struct data
+ reflecting system call time except with the system call number
+ resident in orig_[register]. To allow a system call, the size
+ of the data must be returned. At present, all other return values
+ result in the system call being blocked, but it is recommended to
+ return 0 in those cases. This will allow for future custom return
+ values to be introduced, if ever desired.
+
+ Usage:
+ prctl(PR_ATTACH_SECCOMP_FILTER, prog);
+
+ The 'prog' argument is a pointer to a struct sock_fprog which will
+ contain the filter program. If the program is invalid, the call
+ will return -1 and set errno to -EINVAL.
+
+ The struct user_regs_struct the @prog will see is based on the
+ personality of the task at the time of this prctl call. Additionally,
+ is_compat_task is also tracked for the @prog. This means that once set
+ the calling task will have all of its system calls blocked if it
+ switches its system call ABI (via personality or other means).
+
+ If the @prog is installed while the task has CAP_SYS_ADMIN in its user
+ namespace, the @prog will be marked as inheritable across execve. Any
+ inherited filters are still subject to the system call ABI constraints
+ above and any ABI mismatched system calls will result in process death.
+
+All of the above calls return 0 on success and non-zero on error.
+
+
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/err and exit
+cleanly. Without using a BPF compiler, it may be done as follows on x86 32-bit:
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+int install_filter(void)
+{
+ struct sock_filter filter[] = {
+ /* Grab the system call number */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(orig_eax)),
+ /* Jump table for the allowed syscalls */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 10, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 9, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 8, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 7, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 2, 6),
+
+ /* Check that read is only using stdin. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 3, 4),
+
+ /* Check that write is only using stdout/stderr */
+ BPF_STMT(BPF_LD+BPF_W+BPF_IND, regoffset(ebx)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 0, 1),
+
+ /* Put the "accept" value in A */
+ BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 0),
+
+ BPF_STMT(BPF_RET+BPF_A,0),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (prctl(36, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ return 0;
+}
+
+Additionally, if prctl(2) is allowed by the installed filter, additional
+filters may be layered on which will increase evaluation time, but allow for
+further decreasing the attack surface during execution of a process.
+
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+ a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
--
1.7.5.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
A filter program may be installed by a userland task by calling
prctl(PR_ATTACH_SECCOMP_FILTER, &fprog);
where fprog is of type struct sock_fprog.
If the first filter program allows subsequent prctl(2) calls, then
additional filter programs may be attached. All attached programs
must be evaluated before a system call will be allowed to proceed.
To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() and current->personality, it
is not allowed to make system calls or attach additional filters which
use a different combination of is_compat_task() and
current->personality.
Filter programs may _only_ cross the execve(2) barrier if last filter
program was attached by a task with CAP_SYS_ADMIN capabilities in its
user namespace. Once a task-local filter program is attached from a
process without privileges, execve will fail. This ensures that only
privileged parent task can affect its privileged children (e.g., setuid
binary).
There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time.
- Userland already knows its ABI: expected register layout and system
call numbers.
- Full register information is provided which may be relevant for
certain syscalls (fork, rt_sigreturn) or for other userland
filtering tactics (checking the PC).
- No time-of-check-time-of-use vulnerable data accesses are possible.
This patch includes its own BPF evaluator, but relies on the
net/core/filter.c BPF checking code. It is possible to share
evaluators, but the performance sensitive nature of the network
filtering path makes it an iterative optimization which (I think :) can
be tackled separately via separate patchsets. (And at some point sharing
BPF JIT code!)
Signed-off-by: Will Drewry <w...@chromium.org>
---
fs/exec.c | 5 +
include/linux/prctl.h | 3 +
include/linux/seccomp.h | 70 +++++-
kernel/Makefile | 1 +
kernel/fork.c | 4 +
kernel/seccomp.c | 8 +
kernel/seccomp_filter.c | 639 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 4 +
security/Kconfig | 12 +
9 files changed, 743 insertions(+), 3 deletions(-)
create mode 100644 kernel/seccomp_filter.c
diff --git a/fs/exec.c b/fs/exec.c
index 3625464..e9cc89c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -44,6 +44,7 @@
#include <linux/namei.h>
#include <linux/mount.h>
#include <linux/security.h>
+#include <linux/seccomp.h>
#include <linux/syscalls.h>
#include <linux/tsacct_kern.h>
#include <linux/cn_proc.h>
@@ -1477,6 +1478,10 @@ static int do_execve_common(const char *filename,
if (retval)
goto out_ret;
+ retval = seccomp_check_exec();
+ if (retval)
+ goto out_ret;
+
retval = -ENOMEM;
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
if (!bprm)
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..15e2460 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -64,6 +64,9 @@
#define PR_GET_SECCOMP 21
#define PR_SET_SECCOMP 22
+/* Set process seccomp filters */
+#define PR_ATTACH_SECCOMP_FILTER 36
+
/* Get/set the capability bounding set (as per security/commoncap.c) */
#define PR_CAPBSET_READ 23
#define PR_CAPBSET_DROP 24
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..99d163e 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -5,9 +5,28 @@
#ifdef CONFIG_SECCOMP
#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>
-typedef struct { int mode; } seccomp_t;
+struct seccomp_filter;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode:
+ * if this is 0, seccomp is not in use.
+ * is 1, the process is under standard seccomp rules.
+ * is 2, the process is only allowed to make system calls where
+ * associated filters evaluate successfully.
+ * @filter: Metadata for filter if using CONFIG_SECCOMP_FILTER.
+ * @filter must only be accessed from the context of current as there
+ * is no guard.
+ */
+typedef struct seccomp_struct {
+ int mode;
+#ifdef CONFIG_SECCOMP_FILTER
+ struct seccomp_filter *filter;
+#endif
+} seccomp_t;
extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -28,8 +47,7 @@ static inline int seccomp_mode(seccomp_t *s)
#include <linux/errno.h>
-typedef struct { } seccomp_t;
-
+typedef struct seccomp_struct { } seccomp_t;
#define secure_computing(x) do { } while (0)
static inline long prctl_get_seccomp(void)
@@ -49,4 +67,50 @@ static inline int seccomp_mode(seccomp_t *s)
#endif /* CONFIG_SECCOMP */
+#ifdef CONFIG_SECCOMP_FILTER
+
+#define seccomp_filter_init_task(_tsk) do { \
+ (_tsk)->seccomp.filter = NULL; \
+} while (0);
+
+/* No locking is needed here because the task_struct will
+ * have no parallel consumers.
+ */
+#define seccomp_filter_free_task(_tsk) do { \
+ put_seccomp_filter((_tsk)->seccomp.filter); \
+} while (0);
+
+extern int seccomp_check_exec(void);
+
+extern long prctl_attach_seccomp_filter(char __user *);
+
+extern struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *);
+extern void put_seccomp_filter(struct seccomp_filter *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+extern void seccomp_filter_fork(struct task_struct *child,
+ struct task_struct *parent);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+#include <linux/errno.h>
+
+struct seccomp_filter { };
+#define seccomp_filter_init_task(_tsk) do { } while (0);
+#define seccomp_filter_fork(_tsk, _orig) do { } while (0);
+#define seccomp_filter_free_task(_tsk) do { } while (0);
+
+static inline int seccomp_check_exec(void)
+{
+ return 0;
+}
+
+
+static inline long prctl_attach_seccomp_filter(char __user *a2)
+{
+ return -ENOSYS;
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..0584090 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index da4a6a1..cc1d628 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ seccomp_filter_free_task(tsk);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
/* Perform scheduler related setup. Assign this task to a CPU. */
sched_fork(p);
+ seccomp_filter_init_task(p);
retval = perf_event_init_task(p);
if (retval)
goto bad_fork_cleanup_policy;
@@ -1375,6 +1378,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
if (clone_flags & CLONE_THREAD)
threadgroup_fork_read_unlock(current);
perf_event_fork(p);
+ seccomp_filter_fork(p, current);
return p;
bad_fork_free_pid:
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..78719be 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -47,6 +47,14 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case 2:
+ if (seccomp_test_filters(this_syscall) == 0)
+ return;
+
+ seccomp_filter_log_failure(this_syscall);
+ break;
+#endif
default:
BUG();
}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..4770847
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,639 @@
+/* bpf program-based system call filtering
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromiu...@chromium.org>
+ */
+
+#include <linux/capability.h>
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/rculist.h>
+#include <linux/filter.h>
+#include <linux/kallsyms.h>
+#include <linux/kref.h>
+#include <linux/module.h>
+#include <linux/pid.h>
+#include <linux/prctl.h>
+#include <linux/ptrace.h>
+#include <linux/ratelimit.h>
+#include <linux/reciprocal_div.h>
+#include <linux/regset.h>
+#include <linux/seccomp.h>
+#include <linux/security.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object lifetime.
+ * get/put helpers should be used when accessing an instance
+ * outside of a lifetime-guarded section. In general, this
+ * is only needed for handling filters shared across tasks.
+ * @creator: pointer to the pid that created this filter
+ * @parent: pointer to the ancestor which this filter will be composed with.
+ * @flags: provide information about filter from creation time.
+ * @personality: personality of the process at filter creation time.
+ * @insns: the BPF program instructions to evaluate
+ * @count: the number of instructions in the program.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+ struct kref usage;
+ struct pid *creator;
+ struct seccomp_filter *parent;
+ struct {
+ uint32_t admin:1, /* can allow execve */
+ compat:1, /* CONFIG_COMPAT */
+ __reserved:30;
+ } flags;
+ int personality;
+ unsigned short count; /* Instruction count */
+ struct sock_filter insns[0];
+};
+
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *);
+
+/**
+ * seccomp_filter_alloc - allocates a new filter object
+ * @padding: size of the insns[0] array in bytes
+ *
+ * The @padding should be a multiple of
+ * sizeof(struct sock_filter).
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filter *seccomp_filter_alloc(unsigned long padding)
+{
+ struct seccomp_filter *f;
+ unsigned long bpf_blocks = padding / sizeof(struct sock_filter);
+
+ /* Drop oversized requests. */
+ if (bpf_blocks == 0 || bpf_blocks > BPF_MAXINSNS)
+ return ERR_PTR(-EINVAL);
+
+ /* Padding should always be in sock_filter increments. */
+ BUG_ON(padding % sizeof(struct sock_filter));
+
+ f = kzalloc(sizeof(struct seccomp_filter) + padding, GFP_KERNEL);
+ if (!f)
+ return ERR_PTR(-ENOMEM);
+ kref_init(&f->usage);
+ f->creator = get_task_pid(current, PIDTYPE_PID);
+ f->count = bpf_blocks;
+ return f;
+}
+
+/**
+ * seccomp_filter_free - frees the allocated filter.
+ * @filter: NULL or live object to be completely destructed.
+ */
+static void seccomp_filter_free(struct seccomp_filter *filter)
+{
+ if (!filter)
+ return;
+ put_seccomp_filter(filter->parent);
+ put_pid(filter->creator);
+ kfree(filter);
+}
+
+static void __put_seccomp_filter(struct kref *kref)
+{
+ struct seccomp_filter *orig =
+ container_of(kref, struct seccomp_filter, usage);
+ seccomp_filter_free(orig);
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+ pr_info("%s[%d]: system call %d blocked at 0x%lx\n",
+ current->comm, task_pid_nr(current), syscall,
+ KSTK_EIP(current));
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return;
+ kref_put(&orig->usage, __put_seccomp_filter);
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return NULL;
+ kref_get(&orig->usage);
+ return orig;
+}
+
+static int seccomp_check_personality(struct seccomp_filter *filter)
+{
+ if (filter->personality != current->personality)
+ return -EACCES;
+#ifdef CONFIG_COMPAT
+ if (filter->flags.compat != (!!(is_compat_task())))
+ return -EACCES;
+#endif
+ return 0;
+}
+
+static const struct user_regset *
+find_prstatus(const struct user_regset_view *view)
+{
+ const struct user_regset *regset;
+ int n;
+
+ /* Skip 0. */
+ for (n = 1; n < view->n; ++n) {
+ regset = view->regsets + n;
+ if (regset->core_note_type == NT_PRSTATUS)
+ return regset;
+ }
+
+ return NULL;
+}
+
+/**
+ * seccomp_get_regs - returns a pointer to struct user_regs_struct
+ * @scratch: preallocated storage of size @available
+ * @available: pointer to the size of scratch.
+ *
+ * Returns NULL if the registers cannot be acquired or copied.
+ * Returns a populated pointer to @scratch by default.
+ * Otherwise, returns a pointer to a a u8 array containing the struct
+ * user_regs_struct appropriate for the task personality. The pointer
+ * may be to the beginning of @scratch or to an externally managed data
+ * structure. On success, @available should be updated with the
+ * valid region size of the returned pointer.
+ *
+ * If the architecture overrides the linkage, then the pointer may pointer to
+ * another location.
+ */
+__weak u8 *seccomp_get_regs(u8 *scratch, size_t *available)
+{
+ /* regset is usually returned based on task personality, not current
+ * system call convention. This behavior makes it unsafe to execute
+ * BPF programs over regviews if is_compat_task or the personality
+ * have changed since the program was installed.
+ */
+ const struct user_regset_view *view = task_user_regset_view(current);
+ const struct user_regset *regset = &view->regsets[0];
+ size_t scratch_size = *available;
+ if (regset->core_note_type != NT_PRSTATUS) {
+ /* The architecture should override this method for speed. */
+ regset = find_prstatus(view);
+ if (!regset)
+ return NULL;
+ }
+ *available = regset->n * regset->size;
+ /* Make sure the scratch space isn't exceeded. */
+ if (*available > scratch_size)
+ *available = scratch_size;
+ if (regset->get(current, regset, 0, *available, scratch, NULL))
+ return NULL;
+ return scratch;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+ struct seccomp_filter *filter;
+ u8 regs_tmp[sizeof(struct user_regs_struct)], *regs;
+ size_t regs_size = sizeof(struct user_regs_struct);
+ int ret = -EACCES;
+
+ filter = current->seccomp.filter; /* uses task ref */
+ if (!filter)
+ goto out;
+
+ /* All filters in the list are required to share the same system call
+ * convention so only the first filter is ever checked.
+ */
+ if (seccomp_check_personality(filter))
+ goto out;
+
+ /* Grab the user_regs_struct. Normally, regs == ®s_tmp, but
+ * that is not mandatory. E.g., it may return a point to
+ * task_pt_regs(current). NULL checking is mandatory.
+ */
+ regs = seccomp_get_regs(regs_tmp, ®s_size);
+ if (!regs)
+ goto out;
+
+ /* Only allow a system call if it is allowed in all ancestors. */
+ ret = 0;
+ for ( ; filter != NULL; filter = filter->parent) {
+ /* Allowed if return value is the size of the data supplied. */
+ if (seccomp_run_filter(regs, regs_size, filter->insns) !=
+ regs_size)
+ ret = -EACCES;
+ }
+out:
+ return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called (usually prctl).
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the thread makes.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+ struct seccomp_filter *filter = NULL;
+ /* Note, len is a short so overflow should be impossible. */
+ unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+ long ret = -EPERM;
+
+ /* Allocate a new seccomp_filter */
+ filter = seccomp_filter_alloc(fp_size);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+
+ /* Lock the process personality and calling convention. */
+#ifdef CONFIG_COMPAT
+ if (is_compat_task())
+ filter->flags.compat = 1;
+#endif
+ filter->personality = current->personality;
+
+ /* Auditing is not needed since the capability wasn't requested */
+ if (security_real_capable_noaudit(current, current_user_ns(),
+ CAP_SYS_ADMIN) == 0)
+ filter->flags.admin = 1;
+
+ /* Copy the instructions from fprog. */
+ ret = -EFAULT;
+ if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ goto out;
+
+ /* Check the fprog */
+ ret = sk_chk_filter(filter->insns, filter->count);
+ if (ret)
+ goto out;
+
+ /* If there is an existing filter, make it the parent
+ * and reuse the existing task-based ref.
+ */
+ filter->parent = current->seccomp.filter;
+
+ /* Force all filters to use one system call convention. */
+ ret = -EINVAL;
+ if (filter->parent) {
+ if (filter->parent->flags.compat != filter->flags.compat)
+ goto out;
+ if (filter->parent->personality != filter->personality)
+ goto out;
+ }
+
+ /* Double claim the new filter so we can release it below simplifying
+ * the error paths earlier.
+ */
+ ret = 0;
+ get_seccomp_filter(filter);
+ current->seccomp.filter = filter;
+ /* Engage seccomp if it wasn't. This doesn't use PR_SET_SECCOMP. */
+ if (!current->seccomp.mode) {
+ current->seccomp.mode = 2;
+ set_thread_flag(TIF_SECCOMP);
+ }
+
+out:
+ put_seccomp_filter(filter); /* for get or task, on err */
+ return ret;
+}
+
+long prctl_attach_seccomp_filter(char __user *user_filter)
+{
+ struct sock_fprog fprog;
+ long ret = -EINVAL;
+
+ ret = -EFAULT;
+ if (!user_filter)
+ goto out;
+
+ if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+ goto out;
+
+ ret = seccomp_attach_filter(&fprog);
+out:
+ return ret;
+}
+
+/**
+ * seccomp_check_exec: determines if exec is allowed for current
+ * Returns 0 if allowed.
+ */
+int seccomp_check_exec(void)
+{
+ if (current->seccomp.mode != 2)
+ return 0;
+ /* We can rely on the task refcount for the filter. */
+ if (!current->seccomp.filter)
+ return -EPERM;
+ /* The last attached filter set for the process is checked. It must
+ * have been installed with CAP_SYS_ADMIN capabilities.
+ */
+ if (current->seccomp.filter->flags.admin)
+ return 0;
+ return -EPERM;
+}
+
+/* seccomp_filter_fork: manages inheritance on fork
+ * @child: forkee
+ * @parent: forker
+ * Ensures that @child inherit a seccomp_filter iff seccomp is enabled
+ * and the set of filters is marked as 'enabled'.
+ */
+void seccomp_filter_fork(struct task_struct *child,
+ struct task_struct *parent)
+{
+ if (!parent->seccomp.mode)
+ return;
+ child->seccomp.mode = parent->seccomp.mode;
+ child->seccomp.filter = get_seccomp_filter(parent->seccomp.filter);
+}
+
+/* Returns a pointer to the BPF evaluator after checking the offset and size
+ * boundaries. The signature almost matches the signature from
+ * net/core/filter.c with the hopes of sharing code in the future.
+ */
+static const void *load_pointer(const u8 *buf, size_t buflen,
+ int offset, size_t size,
+ void *unused)
+{
+ if (offset >= buflen)
+ goto fail;
+ if (offset < 0)
+ goto fail;
+ if (size > buflen - offset)
+ goto fail;
+ return buf + offset;
+fail:
+ return NULL;
+}
+
+/**
+ * seccomp_run_filter - evaluate BPF (over user_regs_struct)
+ * @buf: buffer to execute the filter over
+ * @buflen: length of the buffer
+ * @fentry: filter to apply
+ *
+ * Decode and apply filter instructions to the buffer.
+ * Return length to keep, 0 for none. @buf is a regset we are
+ * filtering, @filter is the array of filter instructions.
+ * Because all jumps are guaranteed to be before last instruction,
+ * and last instruction guaranteed to be a RET, we dont need to check
+ * flen.
+ *
+ * See core/net/filter.c as this is nearly an exact copy.
+ * At some point, it would be nice to merge them to take advantage of
+ * optimizations (like JIT).
+ *
+ * A successful filter must return the full length of the data. Anything less
+ * will currently result in a seccomp failure. In the future, it may be
+ * possible to use that for hard filtering registers on the fly so it is
+ * ideal for consumers to return 0 on intended failure.
+ */
+static unsigned int seccomp_run_filter(const u8 *buf,
+ const size_t buflen,
+ const struct sock_filter *fentry)
+{
+ const void *ptr;
+ u32 A = 0; /* Accumulator */
+ u32 X = 0; /* Index Register */
+ u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */
+ u32 tmp;
+ int k;
+
+ /*
+ * Process array of filter instructions.
+ */
+ for (;; fentry++) {
+#if defined(CONFIG_X86_32)
+#define K (fentry->k)
+#else
+ const u32 K = fentry->k;
+#endif
+
+ switch (fentry->code) {
+ case BPF_S_ALU_ADD_X:
+ A += X;
+ continue;
+ case BPF_S_ALU_ADD_K:
+ A += K;
+ continue;
+ case BPF_S_ALU_SUB_X:
+ A -= X;
+ continue;
+ case BPF_S_ALU_SUB_K:
+ A -= K;
+ continue;
+ case BPF_S_ALU_MUL_X:
+ A *= X;
+ continue;
+ case BPF_S_ALU_MUL_K:
+ A *= K;
+ continue;
+ case BPF_S_ALU_DIV_X:
+ if (X == 0)
+ return 0;
+ A /= X;
+ continue;
+ case BPF_S_ALU_DIV_K:
+ A = reciprocal_divide(A, K);
+ continue;
+ case BPF_S_ALU_AND_X:
+ A &= X;
+ continue;
+ case BPF_S_ALU_AND_K:
+ A &= K;
+ continue;
+ case BPF_S_ALU_OR_X:
+ A |= X;
+ continue;
+ case BPF_S_ALU_OR_K:
+ A |= K;
+ continue;
+ case BPF_S_ALU_LSH_X:
+ A <<= X;
+ continue;
+ case BPF_S_ALU_LSH_K:
+ A <<= K;
+ continue;
+ case BPF_S_ALU_RSH_X:
+ A >>= X;
+ continue;
+ case BPF_S_ALU_RSH_K:
+ A >>= K;
+ continue;
+ case BPF_S_ALU_NEG:
+ A = -A;
+ continue;
+ case BPF_S_JMP_JA:
+ fentry += K;
+ continue;
+ case BPF_S_JMP_JGT_K:
+ fentry += (A > K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_K:
+ fentry += (A >= K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_K:
+ fentry += (A == K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_K:
+ fentry += (A & K) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGT_X:
+ fentry += (A > X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JGE_X:
+ fentry += (A >= X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JEQ_X:
+ fentry += (A == X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_JMP_JSET_X:
+ fentry += (A & X) ? fentry->jt : fentry->jf;
+ continue;
+ case BPF_S_LD_W_ABS:
+ k = K;
+load_w:
+ ptr = load_pointer(buf, buflen, k, 4, &tmp);
+ if (ptr != NULL) {
+ /* Note, unlike on network data, values are not
+ * byte swapped.
+ */
+ A = *(const u32 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_H_ABS:
+ k = K;
+load_h:
+ ptr = load_pointer(buf, buflen, k, 2, &tmp);
+ if (ptr != NULL) {
+ A = *(const u16 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_B_ABS:
+ k = K;
+load_b:
+ ptr = load_pointer(buf, buflen, k, 1, &tmp);
+ if (ptr != NULL) {
+ A = *(const u8 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_W_LEN:
+ A = buflen;
+ continue;
+ case BPF_S_LDX_W_LEN:
+ X = buflen;
+ continue;
+ case BPF_S_LD_W_IND:
+ k = X + K;
+ goto load_w;
+ case BPF_S_LD_H_IND:
+ k = X + K;
+ goto load_h;
+ case BPF_S_LD_B_IND:
+ k = X + K;
+ goto load_b;
+ case BPF_S_LDX_B_MSH:
+ ptr = load_pointer(buf, buflen, K, 1, &tmp);
+ if (ptr != NULL) {
+ X = (*(u8 *)ptr & 0xf) << 2;
+ continue;
+ }
+ return 0;
+ case BPF_S_LD_IMM:
+ A = K;
+ continue;
+ case BPF_S_LDX_IMM:
+ X = K;
+ continue;
+ case BPF_S_LD_MEM:
+ A = mem[K];
+ continue;
+ case BPF_S_LDX_MEM:
+ X = mem[K];
+ continue;
+ case BPF_S_MISC_TAX:
+ X = A;
+ continue;
+ case BPF_S_MISC_TXA:
+ A = X;
+ continue;
+ case BPF_S_RET_K:
+ return K;
+ case BPF_S_RET_A:
+ return A;
+ case BPF_S_ST:
+ mem[K] = A;
+ continue;
+ case BPF_S_STX:
+ mem[K] = X;
+ continue;
+ case BPF_S_ANC_PROTOCOL:
+ case BPF_S_ANC_PKTTYPE:
+ case BPF_S_ANC_IFINDEX:
+ case BPF_S_ANC_MARK:
+ case BPF_S_ANC_QUEUE:
+ case BPF_S_ANC_HATYPE:
+ case BPF_S_ANC_RXHASH:
+ case BPF_S_ANC_CPU:
+ case BPF_S_ANC_NLATTR:
+ case BPF_S_ANC_NLATTR_NEST:
+ /* ignored */
+ continue;
+ default:
+ WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
+ fentry->code, fentry->jt,
+ fentry->jf, fentry->k);
+ return 0;
+ }
+ }
+
+ return 0;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index 481611f..77f2eda 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1783,6 +1783,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_SECCOMP:
error = prctl_set_seccomp(arg2);
break;
+ case PR_ATTACH_SECCOMP_FILTER:
+ error = prctl_attach_seccomp_filter((char __user *)
+ arg2);
+ break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
break;
diff --git a/security/Kconfig b/security/Kconfig
index 51bd5a0..77b1106 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -84,6 +84,18 @@ config SECURITY_DMESG_RESTRICT
If you are unsure how to answer this question, answer N.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ select SECCOMP
+ depends on EXPERIMENTAL
+ help
+ This kernel feature expands CONFIG_SECCOMP to allow computing
+ in environments with reduced kernel access dictated by a system
+ call filter, expressed in BPF, installed by the application itself
+ through prctl(2).
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SECURITY
bool "Enable different security models"
depends on SYSFS
To provide a means of reducing the kernel attack surface.
In practice, this is done at the primary kernel ABI: system calls.
Achieving this goal will address the needs expressed by many systems
projects:
qemu/kvm, openssh, vsftpd, lxc, and chromium and chromium os (me).
While system call filtering has been attempted many times, I hope that
this approach shows more promise. It works as described below and in
the patch series.
A userland task may call prctl(PR_ATTACH_SECCOMP_FILTER) to attach a
BPF program to itself. Once attached, all system calls made by the
task will be evaluated by the BPF program prior to being accepted.
Evaluation is done by executing the BPF program over the struct
user_regs_state for the process.
!! If you don't care about background or reasoning, stop reading !!
Past attempts have used:
- bitmap of system call numbers evaluated by seccomp (or tracehooks)
- standalone data structures and extra entry hooks
(cgroups syscall, systrace)
- a collection of ftrace filter strings evaluated by seccomp
- perf_event hackery to allow process termination when an event matches
(or doesn't)
In addition to the publicly posted approaches, I've personally attempted
continued deeper integration with ftrace along a number of different
lines (lead up to that can be found here[1]). What inspired the current
patch series was a number of realizations:
1. Userland knows its ABI - that's how it made the system calls in the
first place.
2. We already exposed a filtering system to userland processes in the
form of BPF and there is continued focus on optimizing evaluation
even after so many years.
3. System call filtering policies should not expose
time-of-check-time-of-use (TOCTOU) vulnerable interfaces but should
expose all the information that may be relevant to a syscall policy
decision.
The prior seccomp-ftrace implementations struggled with very
fixable challenges in ftrace: incomplete syscall coverage,
mismatched syscall names versus unistd, incomplete arch coverage,
etc. These challenges may all be fixed with some time and effort, and
potentially, even closer integration. I explored a number of
alternative approaches from making system call tracepoints per-thread
and "active" to adding a new less-perf-oriented system call.
In the process of experimentation, a number of things became clear:
- perf/ftrace system-wide analysis goals don't align with lightweight
per-thread analysis.
- ftrace/perf ABI doesn't mix well with security policy enforcement,
reduced attack surface environments, or keeping users from specifing
vulnerable filtering policies.
- other than system calls, tracepoints aren't considered ABI-stable.
The core focus of ftrace and perf is to support system-wide
performance and debugging tracing. Despite its amazing flexibility,
there are tradeoffs that are made to provide efficient system-wide
behavior that are less efficient at a per-thread level. For instance,
system call tracepoints are global. It is possible to make them
per-thread (since they use a TIF anyway). However, doing so would mean
that a system-wide system call analysis would require one trace event
per thread rather than one total. It's possible to alleviate that pain,
but that in turn requires more bookkeeping (global versus local
tracepoint registrations mapping to the thread info flag).
Another example is the ftrace ABI. Both the debugfs entry point with
unstable event ids and the perf-oriented perf_event_open(2) are not
suitable to providing a subsystem which is meant to reduce the attack
surface -- much less avoid maintainer flame wars :) The third aspect of
its ABI was also concerning and hints at yet-another-potential struggle.
The ftrace filter language happily accepts globbing and string matching.
This is excellent for tracing, but horrible for system call
interposition. If, despite warning, a user decides that blocking a
system call based on a string is what they want, they can do it. The
result is that their policy may be bypassed due to a time of check, time
of use race. While addressable, it would mean that the filtering engine
would need to allow operation filtering or offer a "secure" subset.
A side challenge that emerged from the desire to enable tracing to act
as a security policy mechanism was the ability to enact policy over more
than just the system calls. While this would be doable if all
tracepoints became active, there is a fundamental problem in that very
little, if any, tracepoints aside from system calls can be considered
stable. If a subset were to emerge as stable, there is still the
challenge of enacting security policy in parallel with tracing policy.
In an example patch where security policy logic was added to
perf_event_open(2), the basics of the system worked, but enforcement of
the security policy was simplistic and intertwined with a large number
of event attributes that were meaningless or altered the behavior.
At every turn, it appears that the tracing infrastructure was unsuited
for being used for attack surface reduction or as a larger security
subsystem on its own. It is well suited for feeding a policy
enforcement mechanism (like seccomp), but not for letting the logic
co-exist. It doesn't mean that it has security problems, just that
there will be a continued struggle between having a really good perf
system and and really good kernel attack surface reduction system if
they were merged. While there may be some distant vision where the
apparent struggle does not exist, I don't see how it would be reached.
Of course, anything is possible with unlimited time. :)
That said, much of that discussion is history and to fill in some of the
gaps since I posted the last ftrace-based patches. This patch series
should stand on its own as both straightforward and effective. In my
opinion, this is the direction I should have taken before I sent my
first patch.
I am looking forward to any and all feedback - thanks!
will
[1] http://search.gmane.org/?query=seccomp+wad%40chromium.org&group=gmane.linux.kernel
Will Drewry (3):
seccomp_filters: dynamic system call filtering using BPF programs
Documentation: prctl/seccomp_filter
Documentation/prctl/seccomp_filter.txt | 179 ++++++++
fs/exec.c | 5 +
include/linux/prctl.h | 3 +
include/linux/seccomp.h | 70 +++++-
kernel/Makefile | 1 +
kernel/fork.c | 4 +
kernel/seccomp.c | 8 +
kernel/seccomp_filter.c | 639 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 4 +
security/Kconfig | 12 +
9 files changed, 743 insertions(+), 3 deletions(-)
create mode 100644 kernel/seccomp_filter.c
create mode 100644 Documentation/prctl/seccomp_filter.txt
One nit:
> +Example
> +-------
> +
> +Assume a process would like to cleanly read and write to stdin/out/err and exit
> +cleanly. Without using a BPF compiler, it may be done as follows on x86 32-bit:
> +
It seems like this little program belongs in the samples/ directory.
Thanks,
jon
Thanks - I'm glad to hear it!
> One nit:
>
>> +Example
>> +-------
>> +
>> +Assume a process would like to cleanly read and write to stdin/out/err and exit
>> +cleanly. �Without using a BPF compiler, it may be done as follows on x86 32-bit:
>> +
>
> It seems like this little program belongs in the samples/ directory.
Cool - I'll do that and rev this patch.
cheers!
will
Signed-off-by: Will Drewry <w...@chromium.org>
---
Documentation/prctl/seccomp_filter.txt | 99 ++++++++++++++++++++++++++++++++
samples/Makefile | 2 +-
samples/seccomp/Makefile | 12 ++++
samples/seccomp/bpf-example.c | 74 ++++++++++++++++++++++++
4 files changed, 186 insertions(+), 1 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
create mode 100644 samples/seccomp/Makefile
create mode 100644 samples/seccomp/bpf-example.c
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..15d4645
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,99 @@
+ Additionally, if prctl(2) is allowed by the attached filter,
+ additional filters may be layered on which will increase evaluation
+ time, but allow for further decreasing the attack surface during
+ execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Example
+-------
+
+samples/seccomp-bpf-example.c shows an example process that allows read from stdin,
+write to stdout/err, exit and signal returns for 32-bit x86.
+
+Caveats
+-------
+
+- execve will fail unless the most recently attached filter was installed by
+ a process with CAP_SYS_ADMIN (in its namespace).
+
+Adding architecture support
+-----------------------
+
+Any platform with seccomp support will support seccomp filters
+as long as CONFIG_SECCOMP_FILTER is enabled.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
# Makefile for Linux samples code
obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..80dc8e4
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,12 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-$(CONFIG_X86_32) := bpf-example
+bpf-example-objs := bpf-example.o
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-example.o += -I$(objtree)/usr/include -m32
+HOSTLOADLIBES_bpf-example += -m32
diff --git a/samples/seccomp/bpf-example.c b/samples/seccomp/bpf-example.c
new file mode 100644
index 0000000..f98b70a
--- /dev/null
+++ b/samples/seccomp/bpf-example.c
@@ -0,0 +1,74 @@
+/*
+ * Seccomp BPF example
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <chromiu...@chromium.org>
+ * Author: Will Drewry <w...@chromium.org>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <asm/unistd.h>
+#include <linux/filter.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <sys/prctl.h>
+#include <sys/user.h>
+#include <unistd.h>
+
+#ifndef PR_ATTACH_SECCOMP_FILTER
+# define PR_ATTACH_SECCOMP_FILTER 36
+#endif
+
+#define regoffset(_reg) (offsetof(struct user_regs_struct, _reg))
+static int install_filter(void)
+ if (prctl(PR_ATTACH_SECCOMP_FILTER, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) _c, sizeof(_c)
+int main(int argc, char **argv) {
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO, payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ return 0;
+}
--
1.7.5.4
Sorry!
will
Hey Will,
A few comments below, but otherwise
Acked-by: Serge Hallyn <serge....@canonical.com>
thanks,
-serge
I still think the BUG_ON here is harsh given that the progsize is passed
in by userspace. Was there a reason not to return -EINVAL here?
This comment is confusing. By 'It must' you mean that if not, it's
denied. But if I didn't know better I would read that as "we can't
get to this code unless". Can you change it to something like
"Exec is refused unless the filter was installed with CAP_SYS_ADMIN
privilege"?
That's very different approach to the system call interposition problem.
I find you solution very interesting. It gives far more capabilities
than my syscalls cgroup that you commented on some time ago. It's ready
now but I haven't tried filtering yet. I think that if your solution
make it to the mainline (and I guess that's really possible at current
stage :)), there will be no place for mine solution but that's ok.
There's one thing that I'm curious about - have you measured overhead in
any way? That was one of the biggest issues in all previous attempts to
limit syscalls. I'd love to compare the numbers with mine solution.
I'll examine your patch later on and put some comments if I bump into
something.
Best Regards,
Lukasz Sowa
Cool ;)
I didn't really read this patch yet, just one nit.
> +#define seccomp_filter_init_task(_tsk) do { \
> + (_tsk)->seccomp.filter = NULL; \
> +} while (0);
Cosmetic and subjective, but imho it would be better to add inline
functions instead of define's.
> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
> free_thread_info(tsk->stack);
> rt_mutex_debug_task_free(tsk);
> ftrace_graph_exit_task(tsk);
> + seccomp_filter_free_task(tsk);
> free_task_struct(tsk);
> }
> EXPORT_SYMBOL(free_task);
> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> /* Perform scheduler related setup. Assign this task to a CPU. */
> sched_fork(p);
>
> + seccomp_filter_init_task(p);
This doesn't look right or I missed something. something seccomp_filter_init_task()
should be called right after dup_task_struct(), at least before copy process can
fail.
Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can put
current->seccomp.filter copied by arch_dup_task_struct().
> +struct seccomp_filter {
> + struct kref usage;
> + struct pid *creator;
Why? seccomp_filter->creator is never used, no?
Oleg.
> Filter programs may _only_ cross the execve(2) barrier if last filter
> program was attached by a task with CAP_SYS_ADMIN capabilities in its
> user namespace. Once a task-local filter program is attached from a
> process without privileges, execve will fail. This ensures that only
> privileged parent task can affect its privileged children (e.g., setuid
> binary).
This means that a non privileged user can not run another program with
limited features? How would a process exec another program and filter
it? I would assume that the filter would need to be attached first and
then the execv() would be performed. But after the filter is attached,
the execv is prevented?
Maybe I don't understand this correctly.
-- Steve
Time to resurrect execve_nosecurity? If so, then the rule could be
simplified to: seccomp programs cannot use normal execve at all.
The longer I linger on lists and see neat ideas like this, the more I
get annoyed that execve is magical. I dream of a distribution that
doesn't use setuid, file capabilities, selinux transitions on exec, or
any other privilege changes on exec *at all*. I think that the only
things missing in the kernel (other than something intelligent to do
about SELinux) are execve_nosecurity and the ability for a normal
program to wait for an unrelated program to finish (or some other way
that a program can ask a daemon to spawn a privileged program for it
and then to cleanly wait for that program to finish in a way that
could survive re-exec of the daemon).
--Andy
I think this model is wrong. The rest of the policy rules all work on the
basis that dumpable is the decider (the same rules for not dumping, not
tracing, etc). A user should be able to apply filter to their own code
arbitarily. Any setuid app should IMHO lose the trace subject to the usual
uid rules and capability rules. That would seem to be more flexible and
also the path of least surprise.
[plus you can implement non setuid exec entirely in userspace so it's
a rather meaningless distinction you propose]
> be tackled separately via separate patchsets. (And at some point sharing
> BPF JIT code!)
A BPF jit ought to be trivial and would be a big win.
In general I like this approach. It's simple, it's compact and it offers
interesting possibilities for solving some interesting problem spaces,
without the full weight of SELinux, SMACK etc which are still needed for
heavyweight security.
Alan
May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
Oleg.
> The longer I linger on lists and see neat ideas like this, the more I
> get annoyed that execve is magical. I dream of a distribution that
> doesn't use setuid, file capabilities, selinux transitions on exec, or
> any other privilege changes on exec *at all*.
Is that the fear with filtering on execv? That if we have filters on an
execv calling a setuid program that we change the behavior of that
privileged program and might cause unexpected results?
In that case, just have execv fail if filtering is enabled and we are
execing a setuid program. But I don't see why non "magical" execv's
should be prohibited.
Stupid question. I am sure you know what are you doing ;) and I know
nothing about !x86 arches.
But could you explain why it is designed to use user_regs_struct ?
Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
Just curious.
Oleg.
> May be this needs something like LSM_UNSAFE_SECCOMP, or perhaps
> cap_bprm_set_creds() should take seccomp.mode == 2 into account, I dunno.
>
> OTOH, currently seccomp.mode == 1 doesn't allow to exec at all.
I've never used seccomp, so I admit I'm totally ignorant on this topic.
But looking at seccomp from the outside, the biggest advantage to this
would be the ability for normal processes to be able to limit tasks it
kicks off. If I want to run a task in a sandbox, I don't want to be root
to do so.
I guess a web browser doesn't perform an exec to run java programs. But
it would be nice if I could execute something from the command line that
I could run in a sand box.
What's the problem with making sure that the setuid isn't set before
doing an execv? Only fail when setuid (or some other magic) is enabled
on the file being exec'd.
Or is this a race where I can have a soft link pointing to a normal
file, run this, and have the link change to a setuid file at just the
right time that causes it to happen?
-- Steve
Exactly.
>
> In that case, just have execv fail if filtering is enabled and we are
> execing a setuid program. But I don't see why non "magical" execv's
> should be prohibited.
>
How do you define "non-magical"?
If setuid is set, then it's obviously magical. On a nosuid
filesystem, strange things happen. If file capabilities are enabled
and set, then different magic happens. With LSMs involved, anything
can be magical. (SELinux AFAICT looks up rules on every single exec,
so it might be impossible for execve to be non-magical.) If execve is
banned entirely when seccomp is enabled, then there will never be any
attacks based on abusing these mechanisms.
My proposal is to have an alternative mechanism that, from a security
POV, does nothing that the caller couldn't have done on its own. The
only reason it would be needed at all is because implementing execve
with correct semantics from userspace is a PITA -- the right set of
fds needs to be closed, threads need to be killed (without races),
vmas need to be found an unmapped, a new program needs to be mapped in
(possibly at the same place that the old one was mapped at),
/proc/self/exe needs to be updated, auxv needs to be recreated
(including using values that glibc might have erased already), etc.
The code is short and it works (although I have no idea whether it
applies to current kernels).
Oleg: my only issue with setting something like LSM_UNSAFE_SECCOMP is
that a different class of vulnerability might be introduced: take a
setuid program that calls other setuid programs (or just uses execve
as a way to get the default execve capability handling, SELinux
handling, etc), run it (as root!) inside seccomp, and watch it
possibly develop security holes. If the alternate execve is a
different syscall, then this can't happen. And if someone remaps
execve to execve_nosecurity (from userspace or via some in-kernel
mechanism) and causes problems, it's entirely clear who to blame.
--Andy
Thanks! Unimportant responses below. Fixes will be incorporated in
the next round (along with Oleg's feedback).
cheers,
will
I've changed it in the next revision. As is, I don't believe
userspace can control
the size of padding directly, just the increment since it specifies
its length in terms
of bpf blocks (sizeof(struct sock_filter)). But EINVAL is certainly
less aggressive :)
Sounds good!
me too ;)
> But looking at seccomp from the outside, the biggest advantage to this
> would be the ability for normal processes to be able to limit tasks it
> kicks off. If I want to run a task in a sandbox, I don't want to be root
> to do so.
>
> I guess a web browser doesn't perform an exec to run java programs. But
> it would be nice if I could execute something from the command line that
> I could run in a sand box.
>
> What's the problem with making sure that the setuid isn't set before
> doing an execv? Only fail when setuid (or some other magic) is enabled
> on the file being exec'd.
I agree. That is why I mentioned LSM_UNSAFE_SECCOMP/cap_bprm_set_creds.
Just I do not know what would be the most simple/clean way to do this.
And in any case I agree that the current seccomp_check_exec() looks
strange. Btw, it does
{
if (current->seccomp.mode != 2)
return 0;
/* We can rely on the task refcount for the filter. */
if (!current->seccomp.filter)
return -EPERM;
How it is possible to have seccomp.filter == NULL with mode == 2?
Oleg.
Refactoring it a bit to make that possible. Since seccomp fork/init/free
never needs access to the whole task_structs, I'll just pass in what's
needed (and avoid the sched.h inclusion recursion).
Comments on the next round will most definitely be appreciated!
>> @@ -166,6 +167,7 @@ void free_task(struct task_struct *tsk)
>> � � � free_thread_info(tsk->stack);
>> � � � rt_mutex_debug_task_free(tsk);
>> � � � ftrace_graph_exit_task(tsk);
>> + � � seccomp_filter_free_task(tsk);
>> � � � free_task_struct(tsk);
>> �}
>> �EXPORT_SYMBOL(free_task);
>> @@ -1209,6 +1211,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> � � � /* Perform scheduler related setup. Assign this task to a CPU. */
>> � � � sched_fork(p);
>>
>> + � � seccomp_filter_init_task(p);
>
> This doesn't look right or I missed something. something seccomp_filter_init_task()
> should be called right after dup_task_struct(), at least before copy process can
> fail.
>
> Otherwise copy_process()->free_fork()->seccomp_filter_free_task() can put
> current->seccomp.filter copied by arch_dup_task_struct().
Ah - makes sense! I moved it under dup_task_struct before any goto's
to bad_fork_free.
>> +struct seccomp_filter {
>> + � � struct kref usage;
>> + � � struct pid *creator;
>
> Why? seccomp_filter->creator is never used, no?
Removing it. It is from a related patch I'm experimenting with (adding
optional tracehook support), but it has no bearing here.
Thanks - new patch revision incoming!
will
Yeah - it means tasks can filter themselves, but not each other.
However, you can inject a filter for any dynamically linked executable
using LD_PRELOAD.
> Maybe I don't understand this correctly.
You're right on. This was to ensure that one process didn't cause
crazy behavior in another. I think Alan has a better proposal than
mine below. (Goes back to catching up.)
There's some seccomp-related code in the vsyscall emulation path in
arch/x86/kernel/vsyscall_64.c. How should time(), getcpu(), and
gettimeofday() be handled? If you want filtering to work, there
aren't any real syscall registers to inspect, but they could be
synthesized.
Preventing a malicious task from figuring out approximately what time
it is is basically impossible because of the way that vvars work. I
don't know how to change that efficiently.
--Andy
My line of thinking up to now has been that disallowing setuid exec
would mean there is no risk of an errant setuid binary allowing escape
from the system call filters (which the containers people may care
more about). Since setuid is privilege escalation, then perhaps it
makes sense to allow it as an escape hatch.
Would it be sane to just disallow setuid exec exclusively?
> [plus you can implement non setuid exec entirely in userspace so it's
> a rather meaningless distinction you propose]
Agreed.
>> be tackled separately via separate patchsets. (And at some point sharing
>> BPF JIT code!)
>
> A BPF jit ought to be trivial and would be a big win.
>
> In general I like this approach. It's simple, it's compact and it offers
> interesting possibilities for solving some interesting problem spaces,
> without the full weight of SELinux, SMACK etc which are still needed for
> heavyweight security.
>
Thanks! Yeah I think merging with the network stack is eminently
doable, but I didn't want to bog down the proposal in how much
overhead I might be adding to the network layer.
It shouldn't be. It's another relic I missed from development. (Adding to v3 :)
So on x86 32, it would work since user_regs_struct == task_pt_regs
(iirc), but on x86-64
and others, that's not true. I don't think it's kosher to expose
pt_regs to the userspace, but if, let's say, x86-32 overrides the weak
linkage, then it could just return task_pt_regs and be the fastest
path.
If it would be appropriate to expose pt_regs to userspace, then I'd
happily do so :)
The whole "fail security escalations" thing goes way beyond just
filtering, I think we could seriously try to make it a generic
feature.
For example, somebody just asked me the other day why "chroot()"
requires admin privileges, since it would be good to limit even
non-root things.
And it's really the exact same issue as filtering: in some sense,
chroot() "filters" FS name lookups, and can be used to fool programs
that are written to be secure.
We could easily introduce a per-process flag that just says "cannot
escalate privileges". Which basically just disables execve() of
suid/sgid programs (and possibly other things too), and locks the
process to the current privileges. And then make the rule be that *if*
that flag is set, you can then filter across an execve, or chroot as a
normal user, or whatever.
There are probably other things like that - things like allowing users
to do bind mounts etc - that aren't dangerous in themselves, but that
are dangerous mainly because they can be used to fool things into
privilege escalations. So this is definitely not a filter-only issue.
Linus
I think that is a policy question. I can imagine cases where either
behaviour is the "right" one so it may need to be a parameter ?
Alan
> The whole "fail security escalations" thing goes way beyond just
> filtering, I think we could seriously try to make it a generic
> feature.
After I wrote this comment I thought the same thing. It would be nice to
have a way to just set a flag to a process that will prevent it from
doing any escalating of privileges.
I totally agree, this would solve a whole host of issues with regard to
security issues in things that shouldn't be a problem but currently are.
-- Steve
Please tell some of us what "BPF" means. wikipedia lists 15 possible
choices, but I don't know which one to choose.
> Signed-off-by: Will Drewry <w...@chromium.org>
> ---
> Documentation/prctl/seccomp_filter.txt | 99 ++++++++++++++++++++++++++++++++
> samples/Makefile | 2 +-
> samples/seccomp/Makefile | 12 ++++
> samples/seccomp/bpf-example.c | 74 ++++++++++++++++++++++++
> 4 files changed, 186 insertions(+), 1 deletions(-)
> create mode 100644 Documentation/prctl/seccomp_filter.txt
> create mode 100644 samples/seccomp/Makefile
> create mode 100644 samples/seccomp/bpf-example.c
--
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
I'll make it clearer in the documentation file and update the patch description.
BPF == Berkeley Packet Filters which are implemented in Linux Socket
Filters (LSF)>
thanks!
Yeah - there've been so many tries, I'll be happy when one makes it in
which is usable :)
> There's one thing that I'm curious about - have you measured overhead in
> any way? That was one of the biggest issues in all previous attempts to
> limit syscalls. I'd love to compare the numbers with mine solution.
Certainly. I have some rough numbers, but nothing I'd call strong
measurements. There is still a fair amount of cost due to the syscall
slow path.
> I'll examine your patch later on and put some comments if I bump into
> something.
Much appreciated - cheers!
will
Yes sure, I meant that userpace should use pt_regs too.
> If it would be appropriate to expose pt_regs to userspace, then I'd
> happily do so :)
Ah, so that was the reason. But it is already exported? At least I see
the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
Once again, I am not arguing, just trying to understand. And I do not
know if this definition is part of abi.
Oleg.
Actually it does. Firefox on Linux forks and execs the Java VM.
Same for Flash, using "plugin-container".
> But it would be nice if I could execute something from the command
> line that I could run in a sand box.
You can do this now, using ptrace(). It's horrible, but half of the
horribleness is needing to understand machine-dependent registers,
which this new patch doesn't address. (The other half is a ton of
undocumented but important ptrace() behaviours on Linux.)
-- Jamie
> > Please tell some of us what "BPF" means. wikipedia lists 15 possible
> > choices, but I don't know which one to choose.
>
> I'll make it clearer in the documentation file and update the patch description.
>
> BPF == Berkeley Packet Filters which are implemented in Linux Socket
> Filters (LSF)>
>
I admit, I was totally clueless in what it meant too ;)
Even the LWN article didn't explain (shame on you Jon).
"he has repurposed the networking layer's packet filtering mechanism
(BPF)"
I didn't know what did the "B" stood for.
-- Steve
That's not entirely true. ptrace supervisors are subject to races and
always fail open. This makes them effective but not as robust as a
seccomp solution can provide.
With seccomp, it fails close. What I think would make sense would be
to add a user-controllable failure mode with seccomp bpf that calls
tracehook_ptrace_syscall_entry(regs). I've prototyped this and it
works quite well, but I didn't want to conflate the discussions.
Using ptrace() would also mean that all consumers of this interface
would need a supervisor, but with seccomp, the filters are installed
and require no supervisors to stick around for when failure occurs.
Does that make sense?
thanks!
will
Ugly method: Using ptrace(), trap after the execve() and issue fake
syscalls to install the filter. I feel dirty thinking it, in a good way.
LD_PRELOAD has been suggested. It's not 100% reliable because not all
executables are dynamic (on some uClinux platforms none of them are),
but it will usually work.
-- Jamie
> You can do this now, using ptrace(). It's horrible, but half of the
> horribleness is needing to understand machine-dependent registers,
> which this new patch doesn't address. (The other half is a ton of
> undocumented but important ptrace() behaviours on Linux.)
Yeah I know the horrid use of ptrace, I've implemented programs that use
it :-p
I guess ptrace can capture the execv and determine if it is OK or not to
run it. But again, this doesn't stop the possible attacks that could
happen, with having the execv on a symlink file, having the ptrace check
say its OK, and then switching the symlink to a setuid file.
When the new execv executed, the parent process would lose all control
over it. The idea is to prevent this.
I like Alan's suggestion. Have userspace decide to allow execv or not,
and even let it decide if it should allow setuid execv's or not, but
still allow non-setuid execvs. If you allow the setuid execv, once that
happens, the same behavior will occur as with ptrace. A setuid execv
will lose all its filtering.
-- Steve
That warm fuzzy feeling :-)
> I guess ptrace can capture the execv and determine if it is OK or not to
> run it. But again, this doesn't stop the possible attacks that could
> happen, with having the execv on a symlink file, having the ptrace check
> say its OK, and then switching the symlink to a setuid file.
>
> When the new execv executed, the parent process would lose all control
> over it. The idea is to prevent this.
fexecve() exists to solve the problem.
Also known as execve("/proc/self/fd/...") on Linux.
> I like Alan's suggestion. Have userspace decide to allow execv or not,
> and even let it decide if it should allow setuid execv's or not, but
> still allow non-setuid execvs. If you allow the setuid execv, once that
> happens, the same behavior will occur as with ptrace. A setuid execv
> will lose all its filtering.
I like the idea of letting the tracer decide what it wants.
-- Jamie
I don't either :/ My original idea was to operate on task_pt_regs(current),
but I noticed that PTRACE_GETREGS/SETREGS only uses the
user_regs_struct. So I went that route.
I'd love for pt_regs to be fair game to cut down on the copying!
will
Makes sense. I'll make it flaggable (ignoring the parallel conversation
about having a thread-wide suidable bit).
thanks!
> > I like Alan's suggestion. Have userspace decide to allow execv or not,
> > and even let it decide if it should allow setuid execv's or not, but
> > still allow non-setuid execvs. If you allow the setuid execv, once that
> > happens, the same behavior will occur as with ptrace. A setuid execv
> > will lose all its filtering.
>
> I like the idea of letting the tracer decide what it wants.
Right, and if we implement the suggestion that Linus made, to set a flag
to prevent a task from every getting privilege, then seccomp can add
that too.
That is, there can be a filter to say "prevent this task from doing
anything with privilege" and that will prevent execv from gaining setuid
privilege. Perhaps, it would still do the execv, but the program that is
executed will run as the normal user, and just fail when it tries to do
something that requires sys admin privilege.
Thus, execv will not be a "special" case here. Seccomp either allows it
or not. But also add a command to tell seccomp that this task will not
be allowed to do anything privileged.
-- Steve
What races do you know about?
I'm not aware of any ptrace races if it's used properly. I'm also not
sure what you mean by fail open/close here, unless you mean the target
process gets to carry on if the tracing process dies.
Having said that, I can think of one race, but I think your BPF scheme
has the same one: After checking the syscall's string arguments and
other pointed to data, another thread can change those arguments
before the real syscall uses them.
> With seccomp, it fails close. What I think would make sense would be
> to add a user-controllable failure mode with seccomp bpf that calls
> tracehook_ptrace_syscall_entry(regs). I've prototyped this and it
> works quite well, but I didn't want to conflate the discussions.
It think it's a nice idea. While you're at it could you fix all the
architectures to actually use tracehooks for syscall tracing ;-)
(I think it's ok to call the tracehook function on all archs though.)
> Using ptrace() would also mean that all consumers of this interface
> would need a supervisor, but with seccomp, the filters are installed
> and require no supervisors to stick around for when failure occurs.
>
> Does that make sense?
It does, I agree that ptrace() is quite cumbersome and you don't
always want a separate tracing process, especially if "failure" means
to die or get an error.
On the other hand, sometimes when a failure occurs, having another
process decide what to do, or log the event, is exactly what you want.
For my nefarious purposes I'm really just looking for a faster way to
reliably trace some activities of individual processes, in particular
tracking which files they access. I'd rather not interfere with
debuggers, so I'd really like your ability to stack multiple filters
to work with separate-process tracing as well. And I'd happily use a
filter rule which can dump some information over a pipe, without
waiting for the tracer to respond in most cases.
-- Jamie
You can already use ptrace() to cause crazy behaviour in another
process, including modifying registers arbitrarily at syscall entry
and exit, aborting and emulating syscalls.
ptrace() is quite slow and it would be really nice to speed it up,
especially for trapping a small subset of syscalls, or limiting some
kinds of access to some file descriptors, while everything else runs
at normal speed.
Speeding up ptrace() with BPF filters would be a really nice. Not
that I like ptrace(), but sometimes it's the only thing you can rely on.
LD_PRELOAD and code running in the target process address space can't
always be trusted in some contexts (e.g. the target process may modify
the tracing code or its data); whereas ptrace() is pretty complete and
reliable, if ugly.
There's already a security model around who can use ptrace(); speeding
it up needn't break that.
If we'd had BPF ptrace in the first place, SECCOMP wouldn't have been
needed as userspace could have done it, with exactly the restrictions
it wants. Google's NaCl comes to mind as a potential user.
-- Jamie
I'm pretty sure that if you have two "isolated" processes, they could
cause irregular behavior using signals.
> I'm not aware of any ptrace races if it's used properly. �I'm also not
> sure what you mean by fail open/close here, unless you mean the target
> process gets to carry on if the tracing process dies.
Exactly. Security systems that, on failure, allow the action to
proceed can't be relied on.
> Having said that, I can think of one race, but I think your BPF scheme
> has the same one: After checking the syscall's string arguments and
> other pointed to data, another thread can change those arguments
> before the real syscall uses them.
Not a problem - BPF only allows register inspection. No TOCTOU attacks
need apply :D
>> With seccomp, it fails close. �What I think would make sense would be
>> to add a user-controllable failure mode with seccomp bpf that calls
>> tracehook_ptrace_syscall_entry(regs). �I've prototyped this and it
>> works quite well, but I didn't want to conflate the discussions.
>
> It think it's a nice idea. �While you're at it could you fix all the
> architectures to actually use tracehooks for syscall tracing ;-)
>
> (I think it's ok to call the tracehook function on all archs though.)
>
>> Using ptrace() would also mean that all consumers of this interface
>> would need a supervisor, but with seccomp, the filters are installed
>> and require no supervisors to stick around for when failure occurs.
>>
>> Does that make sense?
>
> It does, I agree that ptrace() is quite cumbersome and you don't
> always want a separate tracing process, especially if "failure" means
> to die or get an error.
>
> On the other hand, sometimes when a failure occurs, having another
> process decide what to do, or log the event, is exactly what you want.
>
> For my nefarious purposes I'm really just looking for a faster way to
> reliably trace some activities of individual processes, in particular
> tracking which files they access. �I'd rather not interfere with
> debuggers, so I'd really like your ability to stack multiple filters
> to work with separate-process tracing as well. �And I'd happily use a
> filter rule which can dump some information over a pipe, without
> waiting for the tracer to respond in most cases.
Cool - if the rest of this discussion proceeds, then hopefully, we can
move towards discussing if tying it with ptrace is a good idea or a
horrible one :)
thanks!
Like this?
http://lkml.indiana.edu/hypermail/linux/kernel/1003.3/01225.html
(This depends on execve_nosecurity, which is controversial, but that
dependency would be trivial to remove.)
Note that there's a huge can of worms if execve is allowed but
suid/sgid is not: selinux may elevate privileges on exec of pretty
much anything. (I think that this is a really awful idea, but it's in
the kernel, so we're stuck with it.)
--Andy
I don't know the execve_nosecurity patches, so the diff makes little
sense to me, but yeah, I wouldn't expect it to be more than a couple
of lines. Exactly *how* you set the bit etc is not something I care
deeply about, prctl seems about as good as anything.
> Note that there's a huge can of worms if execve is allowed but
> suid/sgid is not: selinux may elevate privileges on exec of pretty
> much anything. �(I think that this is a really awful idea, but it's in
> the kernel, so we're stuck with it.)
You can do any amount of crazy things with selinux, but the other side
of the coin is that it would also be trivial to teach selinux about
this same "restricted environment" bit, and just say that a process
with that bit set doesn't get to match whatever selinux privilege
escalation rules..
I really don't think this is just about "execve cannot do setuid". I
think it's about the process being marked as restricted.
So in your patch, I think that "PR_RESTRICT_EXEC" bit is wrong. It
should simply be "PR_RESTRICT_ME", and be done with it, and not try to
artificially limit it to be some "execve feature", and more think of
it as a "this is a process that has *no* extra privileges at all, and
can never get them".
Linus
Fair enough. I'll submit the simpler patch tonight.
execve_nosecurity was my attempt to sidestep selinux issues. It's a
different syscall that does all of the non-security-related things
that execve does but does not escalate (or even change) any
privileges. Maybe I'll try to rework that for newer kernels as well.
The idea is that programs that expect to run in sandboxes / chroots /
namespaces / whatever can use it, and older programs that might
malfunction dangerously if the semantics of execve change will just
fail instead.
--Andy
I don't see any issues with SELinux support for this feature.
Specifically, when you try to execute something in SELinux, it will
first look at the types and try to "execute" (involving a type
transition IE: security label change).
But if that fails in many cases it may still be allowed to
"execute_no_trans" (IE: regular non-privileged exec() without a
transition).
If you add this feature, it should just disable the normal "execute"
with transition path and unconditionally fall back to
"execute_no_trans".
Likewise, enabling these bits should also disable the "transition" and
"dyntransition" process access vectors, and I'm on the fence about
whether "setfscreate", etc should be allowed.
Cheers,
Kyle Moffett
--
Curious about my work on the Debian powerpcspe port?
I'm keeping a blog here: http://pureperl.blogspot.com/
This sounds cool. Do you think you'll go for a new task_struct member
or will it a securebit? (Seems like securebits might be too tied to
posix file caps, but I figured I'd ask).
I'm planning on going ahead and mocking up your potential patch so I
can respin this series using it and make sure I understand the
interactions.
thanks!
will
Or cred member, etc.
I think securebits and cred didn't exist the first time I did this,
and sticking it in struct cred might unnecessarily prevent sharing
cred (assuming that even happens). So I'd say task_struct.
--Andy
I think it almost has to be task state, since we very much want to
make sure it's trivial to see that nothing ever clears that bit, and
that it always gets copied right over a fork/exec/whatever.
Putting it in some cred or capability bit or somethin would make that
kind of transparency pretty much totally impossible.
Linus
In the ptrace case, doesn't it just downgrade the privileges of the new process
if there is a tracer, rather than detach the tracer?
Ignoring that, I've been looking at system call filters as being equivalent to
something like the caps bounding set. Once reduced, there's no going
back. I think Linus's proposal perfectly resolves the policy decision around
suid execution behavior in the run-with-privs or not scenarios (just like with
how ptrace does it). However, I'd like to avoid allowing any process to
escape system call filters once installed. (It's doable to add
suid/caps-based-bypass, but it certainly not ideal from my perspective.)
cheers,
will
I agree.
In principle, it could be safe for an outside (non-seccomp) process
with appropriate credentials to lift seccomp restrictions from a
different process. But why?
--Andy
> >> You can do any amount of crazy things with selinux, but the other side
> >> of the coin is that it would also be trivial to teach selinux about
> >> this same "restricted environment" bit, and just say that a process
> >> with that bit set doesn't get to match whatever selinux privilege
> >> escalation rules..
> I don't see any issues with SELinux support for this feature.
>
> Specifically, when you try to execute something in SELinux, it will
> first look at the types and try to "execute" (involving a type
> transition IE: security label change).
>
> But if that fails in many cases it may still be allowed to
> "execute_no_trans" (IE: regular non-privileged exec() without a
> transition).
That's not true. See specifically
security/selinux/hooks.c::selinux_bprm_set_creds() We calculate a label
for the new task (that may or may not be the same) and then check if
there is permission to run the new binary with the new label. There is
no fallback.
The exception would be if the binary is on a MNT_NOSUID mount point, in
which case we calculate the new label, then just revert to the same
label.
At first glance it looks to me like a reasonable way to implement this
at first would be to do the new checks right next to any place we
already do MNT_NOSUID checks and mimic their behavior. If there are
other priv escalation points in the kernel we might need to consider if
MNT_NOSUID is adequate....
-Eric
A setuid binary is not necessarily priviledged - indeed a root -> user
transition via setuid is pretty much the reverse.
It's a change of user context. Things like ptrace and file permissions
basically mean you can't build a barrier between stuff running as the
same uid to a great extent except with heavy restricting, but saying
"you can't become someone else" is very useful.
Alan
I don't really like the current logic. It does:
if (old_tsec->exec_sid) {
new_tsec->sid = old_tsec->exec_sid;
/* Reset exec SID on execve. */
new_tsec->exec_sid = 0;
} else {
/* Check for a default transition on this program. */
rc = security_transition_sid(old_tsec->sid, isec->sid,
SECCLASS_PROCESS, NULL,
&new_tsec->sid);
if (rc)
return rc;
}
COMMON_AUDIT_DATA_INIT(&ad, PATH);
ad.u.path = bprm->file->f_path;
if (bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)
new_tsec->sid = old_tsec->sid;
which means that, if MNT_NOSUD, then exec_sid is silently ignored.
I'd rather fail in that case, but it's probably too late for that.
However, if we set the "no new privileges" flag, then we could fail,
since there's no old ABI to be compatible with. I'll implement it
that way.
--Andy
Not just "someone else".
The guarantee basically has to be "you can't change your security
context". Where "become somebody else" is part of it, but any
capability changes etc would be part of it too. So it should disable
all games with capabilities etc.
And I don't think selinux really should be all that much of a problem
- we should just make sure that selinux would honor such a bit, and
refuse to do any op that would change any selinux capabilities either.
Same goes for other security models.
And that may include restricting the ways a binary can be executed
totally outside of suid/sgid bits. For example, if you consider
binaries under /home to have different selinxu rules than system
binaries in /usr/bin, then a cross-execute from one to the other may
not work, regardless of whether it's suid or not.
I think that is the kind of guarantee a sandbox environment really
wants: "I'm setting up a sandbox, you'd better not change the
permissions on me regardless of what crazy things I do".
Linus
> In general I like this approach. It's simple, it's compact and it offers
> interesting possibilities for solving some interesting problem spaces,
> without the full weight of SELinux, SMACK etc which are still needed for
> heavyweight security.
Yes, I can see potential to vastly simplify MAC policy in some cases.
- James
--
James Morris
<jmo...@namei.org>
Do you have an example? I'm not aware of one and I've been studying
ptrace quite a bit lately. If there's a race (other than temporary
kernel bugs with all the ptrace patching lately ;-), I would like to
know and maybe patch it.
The only signal confusion when ptracing syscalls I'm aware of is with
SIGTRAP, and that was fixed in 2.5.46, long, long ago (PTRACE_SETOPTIONS).
> > I'm not aware of any ptrace races if it's used properly. �I'm also not
> > sure what you mean by fail open/close here, unless you mean the target
> > process gets to carry on if the tracing process dies.
>
> Exactly. Security systems that, on failure, allow the action to
> proceed can't be relied on.
That's fair enough. There are numerous occasions when ptracer death
should kill the tracee anyway regardless of security. E.g. "strace
command..." and strace dies, you'd normally want the command to
be killed as well. So that could be worth a ptrace option anyway.
Thanks,
-- Jamie
I think execve should be allowed and follow the same rules as execve under ptrace.
That one could be easily fixed with a new ptrace option.
The tracer can kill all traced tasks before it dies except when it exits
with a SIGKILL. In that case another observer task could kill all the
traced tasks, but that is just moving the problem around.
> Having said that, I can think of one race, but I think your BPF scheme
> has the same one: After checking the syscall's string arguments and
> other pointed to data, another thread can change those arguments
> before the real syscall uses them.
I have implemented a ptrace based jailer which avoids these kinds of
races by copying such strings to read-only memory before the system call
is allowed to proceed. Only races that can't be closed with ptrace are
symlink races, and then only with an attacker outside the jail.
And the architectural differences in registers are easily abstracted away
when you're only interested in system call arguments and the instruction
pointer. The system call table information is more annoying, but unavoidable.
Our jailer is around 5k lines of code and supports checking file paths, PIDs,
FDs, SYSV IPC and has limited networking support (no incoming peer address
filtering), all race free. The idea is transparent jailing of complex tasks
with minimal configuration (everything is contained within the jail, access
to anything else needs explicit permission). It's more or less finished for
a few years now, but everyone is busy with other things and no one got around
releasing the code. :-/
It would be nice to avoid the ptrace overhead for system calls that are always
allowed or always denied, so I hope this BPF filtering can be made to work in
conjunction with ptrace so that the tracer only has to handle system calls not
handled by the BPF filter. One way to achieve that is to have a way for the BPF
filter to let a system call generate ptrace system call events or not, with a
new ptrace option PTRACE_UNHANDLED_SYSCALL or something like that to ask for
the unhandled system calls events.
Greetings,
Indan
Yeah, that's one and it's a pretty awful one when you can consider
that the untrusted tracee can play games such as trying to get the
kernel to fire OOM SIGKILLs.
My memory is hazy but the last time I looked at this in detail there
were other racy areas:
- Bad problems if the tracee takes a SIGTSTP or (real) SIGCONT.
- Difficulty in stopping the syscall from executing once it has
started, especially if the tracer dies.
Cheers
Chris
Well, I don't know where user_regs_struct come from initially. But
probably it is needed to allow to access the "artificial" things like
fs_base. Or perhaps this struct mimics the layout in the coredump.
> I'd love for pt_regs to be fair game to cut down on the copying!
Me too. I see no point in using user_regs_struct.
Oleg.
Not sure - added Roland whose name was on many of the files :)
I just noticed that ptrace ABI allows pt_regs access using the register
macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).
But I think the latter is guaranteed to have a certain layout while the macros
for PEEKUSR can do post-processing fixup. (Which could be done in the
bpf evaluator load_pointer() helper if needed.)
>> I'd love for pt_regs to be fair game to cut down on the copying!
>
> Me too. I see no point in using user_regs_struct.
I'll rev the change to use pt_regs and drop all the helper code. If
no one says otherwise, that certainly seems ideal from a performance
perspective, and I see pt_regs exported to userland along with ptrace
abi register offset macros.
Thanks!
will
On second thought, pt_regs is scary :)
From looking at
http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
and ia32syscall enty code, it appears that for x86, at least, the
pt_regs for compat processes will be 8 bytes wide per register on the
stack. This means if a self-filtering 32-bit program runs on a 64-bit host in
IA32_EMU, its filters will always index into pt_regs incorrectly.
I'm not 100% that I am reading the code right, but it means that I can either
keep using user_regs_struct or fork the code behavior based on compat. That
would need to be arch dependent then which is pretty rough.
Any thoughts?
I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
about the pt_regs
change yet. If the performance boost is worth the effort of having a
per-arch fixup,
I can go that route. Otherwise, I could look at some alternate approach for a
faster-than-regview payload.
Thanks!
Ugh. Sorry about the formatting. (The other option is to disallow compat ;).
cheers!
will
The rules are of the form:
label object:
value label
So using Will's BPF example code in my syntax looks like:
start syscall:
rt_sigreturn success
sigreturn success
exit_group success
exit success
read read
write write
read arg0:
0 success
write arg0:
1 success
2 success
So this says the first label is "start" and it is going to deal with the
syscall number. The first value is 'rt_sigreturn' and if syscall ==
rt_sigreturn will cause you to jump to 'success' (success and fail are
implied labels). If the syscall is 'write' we will jump to 'write.'
The write rules look at arg0. If arg0 == "1" we jump to "success". If
you run that syntax through my translator you should get Will's BPF
rules!
You'll quickly notice that the translator only understands "syscall" and
"arg0" and only x86_32, but it should be easy to add more, support the
right registers on different arches, etc, etc. If others think they
might want to hack on the translator I put it at:
http://git.infradead.org/users/eparis/bpf-translate.git
-Eric
(I hoped everyone is CC'ed, emails forwarded via lkml truncate the CC list
when it's too long.)
On Sat, January 14, 2012 00:10, Will Drewry wrote:
> On Fri, Jan 13, 2012 at 1:01 PM, Will Drewry <w...@chromium.org> wrote:
>> On Fri, Jan 13, 2012 at 11:31 AM, Oleg Nesterov <ol...@redhat.com> wrote:
>>> On 01/12, Will Drewry wrote:
>>>>
>>>> On Thu, Jan 12, 2012 at 11:23 AM, Oleg Nesterov <ol...@redhat.com> wrote:
>>>> > On 01/12, Will Drewry wrote:
>>>> >>
>>>> >> On Thu, Jan 12, 2012 at 10:22 AM, Oleg Nesterov <ol...@redhat.com> wrote:
>>>> >> >> + � � �*/
>>>> >> >> + � � regs = seccomp_get_regs(regs_tmp, ®s_size);
>>>> >> >
>>>> >> > Stupid question. I am sure you know what are you doing ;) and I know
>>>> >> > nothing about !x86 arches.
>>>> >> >
>>>> >> > But could you explain why it is designed to use user_regs_struct ?
>>>> >> > Why we can't simply use task_pt_regs() and avoid the (costly) regsets?
>>>> >>
>>>> >> So on x86 32, it would work since user_regs_struct == task_pt_regs
>>>> >> (iirc), but on x86-64
>>>> >> and others, that's not true.
>>>> >
>>>> > Yes sure, I meant that userpace should use pt_regs too.
>>>> >
>>>> >> If it would be appropriate to expose pt_regs to userspace, then I'd
>>>> >> happily do so :)
>>>> >
>>>> > Ah, so that was the reason. But it is already exported? At least I see
>>>> > the "#ifndef __KERNEL__" definition in arch/x86/include/asm/ptrace.h.
>>>> >
>>>> > Once again, I am not arguing, just trying to understand. And I do not
>>>> > know if this definition is part of abi.
>>>>
>>>> I don't either :/ �My original idea was to operate on task_pt_regs(current),
>>>> but I noticed that PTRACE_GETREGS/SETREGS only uses the
>>>> user_regs_struct. So I went that route.
>>>
>>> Well, I don't know where user_regs_struct come from initially. But
>>> probably it is needed to allow to access the "artificial" things like
>>> fs_base. Or perhaps this struct mimics the layout in the coredump.
>>
>> Not sure - added Roland whose name was on many of the files :)
>>
>> I just noticed that ptrace ABI allows pt_regs access using the register
>> macros (PTRACE_PEEKUSR) and user_regs_struct access (PTRACE_GETREGS).
>>
>> But I think the latter is guaranteed to have a certain layout while the macros
>> for PEEKUSR can do post-processing fixup. �(Which could be done in the
>> bpf evaluator load_pointer() helper if needed.)
>>
>>>> I'd love for pt_regs to be fair game to cut down on the copying!
>>>
>>> Me too. I see no point in using user_regs_struct.
>>
>> I'll rev the change to use pt_regs and drop all the helper code. �If
>> no one says otherwise, that certainly seems ideal from a performance
>> perspective, and I see pt_regs exported to userland along with ptrace
>> abi register offset macros.
>
> On second thought, pt_regs is scary :)
>
> From looking at
> http://lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
> and ia32syscall enty code, it appears that for x86, at least, the
> pt_regs for compat processes will be 8 bytes wide per register on the
> stack. This means if a self-filtering 32-bit program runs on a 64-bit host in
> IA32_EMU, its filters will always index into pt_regs incorrectly.
>
> I'm not 100% that I am reading the code right, but it means that I can either
> keep using user_regs_struct or fork the code behavior based on compat. That
> would need to be arch dependent then which is pretty rough.
>
> Any thoughts?
>
> I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
> about the pt_regs
> change yet. If the performance boost is worth the effort of having a
> per-arch fixup,
> I can go that route. Otherwise, I could look at some alternate approach
> for a faster-than-regview payload.
I recommend not giving access to the registers at all, but to instead provide
a fixed cross-platform ABI (a 32 bit version and one 64 bit version).
As everyone dealing with system call is mostly interested in the same things:
The syscall number and the arguments. You can add some other potentially useful
info like instruction pointer as well, but keep it limited to cross-platform
things with a clear meaning that make sense for system call filtering.
So I propose an interface like the following instead of a register interface:
/* Currently 6, but to be future proof, make it 8 */
#define MAX_SC_ARGS 8
struct syscall_bpf_data {
unsigned long syscall_nr;
unsigned long flags;
unsigned long instruction_pointer;
unsigned long arg[MAX_SC_ARGS];
unsigned long _reserved[5];
};
The flag argument can be used to e.g. tell if it is a compat 32 program
running on a 64 bit system.
This way the registers have to be interpreted only once by the kernel and all
filtering programs don't have to do that mapping themselves. It also avoids
doing unnecessary work fiddling/translating registers like the ptrace ABI does.
I missed if the original version was allowed to change the registers or not,
if it is then perhaps the BPF program should set a specific flag after changing
anything, to make it more explicit.
Greetings,
Indan
I don't believe it is possible to create a fixed cross-platform ABI
that will be compat and future safe. user_regs_struct (or even
pt_regs) should always match whatever each arch is doing -- even weird
personality-based changes. If we do a new ABI, not only does that
have to be exported to userland, but it means we're still copying and
munging the data around (which was why I was trying to see if pt_regs
was a easier way to get a speed boost).
If there's consensus, I'll change it (and use syscall_get_arguments),
but I don't believe it makes sense. (more below)
> As everyone dealing with system call is mostly interested in the same things:
> The syscall number and the arguments. You can add some other potentially useful
> info like instruction pointer as well, but keep it limited to cross-platform
> things with a clear meaning that make sense for system call filtering.
Well there are also clone/fork, sig_rt_return, etc related registers
too. I like not making the decision for each syscall filtering
consumer. We have an ABI so it seems like I'd be making work for the
kernel to manage yet another one for system call calling conventions.
> So I propose an interface like the following instead of a register interface:
>
> /* Currently 6, but to be future proof, make it 8 */
> #define MAX_SC_ARGS 8
>
> struct syscall_bpf_data {
> unsigned long syscall_nr;
> unsigned long flags;
> unsigned long instruction_pointer;
> unsigned long arg[MAX_SC_ARGS];
> unsigned long _reserved[5];
> };
>
> The flag argument can be used to e.g. tell if it is a compat 32 program
> running on a 64 bit system.
I certainly considered this, but I don't think this is a practical
idea. Firstly, CONFIG_COMPAT is meant to be compatibility mode. We
can't assume a program knows about it. Second, if we assume any new
program will be "smart" and check @flags, then the first few
instruction of _every_ (32-bit) seccomp filter program will be
checking compat mode - a serious waste :( I'm also not sure if
is_compat_task actually covers all random personality-based changes --
just 32-bit v 64-bit.
I _really_ wanted to make compat a flag and push that logic out of the
kernel, but I don't think it makes sense to burden all ABI consumers
with a "just in case" compat flag check. Also, what happens if a new,
weird architecture comes along where that flag doesn't make the same
sense? We can fix all the internal kernel stuff, but we'd end up with
an ABI change to boot :/ Using regviews, we stay consistent
regardless of whatever the new craziness is. I just wish there was a
way to make it speedier.
> This way the registers have to be interpreted only once by the kernel and all
> filtering programs don't have to do that mapping themselves. It also avoids
> doing unnecessary work fiddling/translating registers like the ptrace ABI does.
The kernel does only interpret them once (after entry to
__secure_computing). It gets the regview and has it populate a
user_regs_struct. All the register info is per-arch and matches
PTRACE_GETREGS, but not PTRACE_PEEKUSR. All the weird stuff is in
PEEKUSR to deal with the fact that compat pt_regs members are not
actually the same width as userspace would expect. If we populated an
ABI as you've proposed, we'd at least need to build that data set and
give it syscall_get_arguments() output.
I was hoping I could just hand over pt_regs and avoid any processing,
but it doesn't look promising. In theory, all the same bit-twiddling
compat_ptrace does could be done during load_pointer in the patch
series, but it seems wrong to go that route.
> I missed if the original version was allowed to change the registers or not,
> if it is then perhaps the BPF program should set a specific flag after changing
> anything, to make it more explicit.
Registers are const from the BPF perspective (just like with socket
filters). Adding support for tracehook interception later could
allow for supervisor guided register mutation.
Thanks!
will
On Mon, January 16, 2012 02:40, Will Drewry wrote:
> On Sat, Jan 14, 2012 at 9:40 PM, Indan Zupancic <in...@nul.nu> wrote:
>> On Sat, January 14, 2012 00:10, Will Drewry wrote:
>>> Any thoughts?
>>>
>>> I'll do a v5 rev for Eric's comments soon, but I'm not quite sure
>>> about the pt_regs
>>> change yet. If the performance boost is worth the effort of having a
>>> per-arch fixup,
>>> I can go that route. Otherwise, I could look at some alternate approach
>>> for a faster-than-regview payload.
>>
>> I recommend not giving access to the registers at all, but to instead provide
>> a fixed cross-platform ABI (a 32 bit version and one 64 bit version).
>
> I don't believe it is possible to create a fixed cross-platform ABI
> that will be compat and future safe.
The main problem is that BPF is inherently 32 bits while system call
arguments can be 64 bits.
You could change it so that all BPF programs are always 64 bits, which
would solve everything, except the when-to-sign-extend-and-when-not-to
problem. Luckily tgkill() is the only system call I'm aware of which
would have preferred sign extension, so never sign extending is fairly
safe. But this would require a new user space ABI as it's incompatible
with the current sock_filter.
You could make it work on "unsigned long" and solve all subtleties
that way. For 32 bits compat mode you execute the 32 bit version, if
you want to support it. This would require two seccomp_run_filter()
versions if compat mode is supported. Again, would change the BPF
filter ABI.
So considering that you seem to be stuck with running 32 bits BPF
filters anyway, you can as well make the input always 32 bits or
always 64 bits. 32 bits would lose information sometimes, so making
it always 64 bits without sign extension seems logical. This would
uglify the actual BPF filters immensely though, because then they
often have to check the upper argument half too, usually just to
make sure it's zero. They can't be sure that the kernel will ignore
the upper half for 'int' arguments.
So I suppose the question is: How do you expect people to use this on
64 bit systems?
> user_regs_struct (or even
> pt_regs) should always match whatever each arch is doing -- even weird
> personality-based changes.
I don't think weird personality-based changes are really relevant, no one
is going to write different versions of each filter depending on which
personality is being run. Perhaps weird personality changes should be
denied when a filter is installed, in case the filter forgets to do it.
So if the personality would change in a drastic way across an execve,
it should fail. That is something a filter can't check beforehand and
the only way to deal with it afterwards is by checking what the current
personality is for each system call.
All in all I think filters should be per personality, and if a process's
personality changes it is only allowed if there is also a filter installed
for the new personality too. Or just disallow personality changes, wasn't
that what your patch did anyway?
> If we do a new ABI, not only does that
> have to be exported to userland, but it means we're still copying and
> munging the data around (which was why I was trying to see if pt_regs
> was a easier way to get a speed boost).
The difference is that the register ABI uses the messy ptrace ABI which
is a bit strange and not cross-platform, while simply exporting system
call arguments is plain simple and what everyone tries to get out of the
pt_regs anyway.
But considering system call numbers are platform dependent anyway, it
doesn't really matter that much. I think an array of syscall arguments
would be a cleaner interface, but struct pt_regs would be acceptable.
> If there's consensus, I'll change it (and use syscall_get_arguments),
> but I don't believe it makes sense. (more below)
It's about system call filtering, I'd argue that giving anything else than
the arguments doesn't make sense. Yes, the registers are the current system
call ABI and reusing that is in a way simpler, but that ABI is about telling
the kernel what the arguments are, it's not the best way for the kernel to
tell userspace what the arguments are because for userspace it ends up in
some structure with its own ABI instead of in the actual registers.
>> As everyone dealing with system call is mostly interested in the same things:
>> The syscall number and the arguments. You can add some other potentially useful
>> info like instruction pointer as well, but keep it limited to cross-platform
>> things with a clear meaning that make sense for system call filtering.
>
> Well there are also clone/fork, sig_rt_return, etc related registers
> too.
What special clone/fork registers are you talking about?
I don't think anyone would want to ever filter sig_rt_return, you can as
well kill the process.
> I like not making the decision for each syscall filtering
> consumer. We have an ABI so it seems like I'd be making work for the
> kernel to manage yet another one for system call calling conventions.
I think it's pretty much about the arguments and not much else. Even
adding instruction pointer was a bit of a stretch, but it's something
I can imagine people using to make decisions. But as the BPF filters
are stateless they can't really use much else than the syscall number
and arguments anyway, the rest is usually too context dependent.
In order of exponentially less likelihood to filter on:
- syscall number
- syscall arguments
- Instruction pointer
- Stack pointer
- ...?!
Keep in mind that x86 only has a handful registers, 6 of them are used
for the arguments, one is the syscall number and return value, one is
the instruction pointer and there's the stack pointer. There just isn't
much room for much else.
Adding the instruction and stack pointers is quite a stretch already and
should cover pretty much any need. If there is any other information that
might be useful for filtering system calls I'd like to hear about it.
>> So I propose an interface like the following instead of a register interface:
>>
>> /* Currently 6, but to be future proof, make it 8 */
>> #define MAX_SC_ARGS 8
>>
>> struct syscall_bpf_data {
>> unsigned long syscall_nr;
>> unsigned long flags;
>> unsigned long instruction_pointer;
>> unsigned long arg[MAX_SC_ARGS];
>> unsigned long _reserved[5];
>> };
BTW, the width of the fields depends on how you want to resolve
the 64 bit issue. As BPF is always 32 bits, it doesn't make much
sense to use longs. And as offsets are used anyway, it probably
makes more sense to define those instead of a structure.
>>
>> The flag argument can be used to e.g. tell if it is a compat 32 program
>> running on a 64 bit system.
>
> I certainly considered this, but I don't think this is a practical
> idea. Firstly, CONFIG_COMPAT is meant to be compatibility mode. We
> can't assume a program knows about it. Second, if we assume any new
> program will be "smart" and check @flags, then the first few
> instruction of _every_ (32-bit) seccomp filter program will be
> checking compat mode - a serious waste :( I'm also not sure if
> is_compat_task actually covers all random personality-based changes --
> just 32-bit v 64-bit.
Yeah, bad idea. Forget about the flag thing.
> I _really_ wanted to make compat a flag and push that logic out of the
> kernel, but I don't think it makes sense to burden all ABI consumers
> with a "just in case" compat flag check. Also, what happens if a new,
> weird architecture comes along where that flag doesn't make the same
> sense? We can fix all the internal kernel stuff, but we'd end up with
> an ABI change to boot :/ Using regviews, we stay consistent
> regardless of whatever the new craziness is. I just wish there was a
> way to make it speedier.
Better to have filters per personality. That solves this whole issue,
independently of regviews or argument list ABI.
>> This way the registers have to be interpreted only once by the kernel and all
>> filtering programs don't have to do that mapping themselves. It also avoids
>> doing unnecessary work fiddling/translating registers like the ptrace ABI does.
>
> The kernel does only interpret them once (after entry to
> __secure_computing).
Not if data shuffling is needed for compat related stuff.
> It gets the regview and has it populate a
> user_regs_struct. All the register info is per-arch and matches
> PTRACE_GETREGS, but not PTRACE_PEEKUSR.
GETREGS seems to be a subset of PEEKUSR. That is, both start with
a struct pt_regs/user_regs_struct (seems to be the same thing?)
PEEKUSR only has extra access to debugging registers.
That is another problem of giving a register view: Which registers
are you going to give access to?
> All the weird stuff is in
> PEEKUSR to deal with the fact that compat pt_regs members are not
> actually the same width as userspace would expect.
>
> If we populated an ABI as you've proposed, we'd at least need to
> build that data set and give it syscall_get_arguments() output.
Yes, but that's all you have to do, nothing more.
The pt_regs a 64 bit kernel builds for a 32 bit compat process is
different than one from a 32 bit kernel, so you have to do some kind
of data shuffling anyway.
Worse, once you pick this ABI you're stuck with it and can't get rid
of compat horrors like you have now with ptrace(). Do you really want
to reuse an obscure ptrace ABI instead of creating a simpler new one?
> I was hoping I could just hand over pt_regs and avoid any processing,
> but it doesn't look promising. In theory, all the same bit-twiddling
> compat_ptrace does could be done during load_pointer in the patch
> series, but it seems wrong to go that route.
Your problem is worse because BPF programs are 32 bits but registers/args
can be 64 bit. Compared to that, running 32 bits on top of 64 bits seems
easy.
Do you propose that people not only know about 64 bitness, but also
about endianness when grabbing bits and pieces of 64 bit registers?
Because that seems like a fun source of bugs.
>> I missed if the original version was allowed to change the registers or not,
>> if it is then perhaps the BPF program should set a specific flag after changing
>> anything, to make it more explicit.
>
> Registers are const from the BPF perspective (just like with socket
> filters). Adding support for tracehook interception later could
> allow for supervisor guided register mutation.
If the ABI gives access to arguments instead of registers you don't have
to do anything tricky: No security checks, no need for fixing up register
values to their original value after the system call returns or any other
subtleties. BPF filters can just change the values without side effects.
I would prefer if it would work nicely with a ptrace supervisor, because
to me it seems that if something can't be resolved in the BPF filter, more
context and direct control is needed. The main downside of ptrace for
jailing is its overhead (and some quirks). If that can be taken away for
most system calls by using BPF then it would be useful for my use case.
Greetings,
Indan
Yes, thanks, I forgot about compat tasks again. But this is easy, just
we need regs_64_to_32().
Doesn't matter. I think Indan has a better suggestion.
Oleg.
Inefficient, but ok :)
> You could change it so that all BPF programs are always 64 bits, which
> would solve everything, except the when-to-sign-extend-and-when-not-to
> problem. Luckily tgkill() is the only system call I'm aware of which
> would have preferred sign extension, so never sign extending is fairly
> safe. But this would require a new user space ABI as it's incompatible
> with the current sock_filter.
Yeah, I don't think that'd be a good idea. I suspect that someday BPF
will grow a 64-bit set of instructions to deal with incoming socket
data better, but there's no rush.
> You could make it work on "unsigned long" and solve all subtleties
> that way. For 32 bits compat mode you execute the 32 bit version, if
> you want to support it. This would require two seccomp_run_filter()
> versions if compat mode is supported. Again, would change the BPF
> filter ABI.
Agreed - this seems like a bad path.
> So considering that you seem to be stuck with running 32 bits BPF
> filters anyway, you can as well make the input always 32 bits or
> always 64 bits. 32 bits would lose information sometimes, so making
> it always 64 bits without sign extension seems logical.
This is not fully formed logic. BPF can operate on values that exceed
its "bus" width. Just because it can't do a load/jump on a 64-bit
value doesn't mean you can't implement a jump based on a 64-bit value.
You just split it up. Load the high order 32-bits, jump on failure,
fall through and load the next 32-bits and do the rest of the
comparison. Extending Eric's python translator wouldn't be
particularly painful and then it would be transparent to an end user.
> This would
> uglify the actual BPF filters immensely though, because then they
> often have to check the upper argument half too, usually just to
> make sure it's zero. They can't be sure that the kernel will ignore
> the upper half for 'int' arguments.
Of course not! This will uglify the filters until someday when BPF
grows 64-bit support (or never), but it's not that bad. The BPF
doesn't need to be pretty, just effective. And it could be made even
easier with JIT support enabled since it could provide better native
code behaviors.
> So I suppose the question is: How do you expect people to use this on
> 64 bit systems?
As mentioned above. The whole point of using BPF and user_regs_struct
is to implement _less_ kernel magic instead of more.
>> user_regs_struct (or even
>> pt_regs) should always match whatever each arch is doing -- even weird
>> personality-based changes.
>
> I don't think weird personality-based changes are really relevant, no one
> is going to write different versions of each filter depending on which
> personality is being run. Perhaps weird personality changes should be
> denied when a filter is installed, in case the filter forgets to do it.
It does this in the patch today. Personalities can affect system call
numbers and argument ordering so it is relevant. It'd also be a
viable way to escape system call filters if they weren't locked.
> So if the personality would change in a drastic way across an execve,
> it should fail. That is something a filter can't check beforehand and
> the only way to deal with it afterwards is by checking what the current
> personality is for each system call.
>
> All in all I think filters should be per personality, and if a process's
> personality changes it is only allowed if there is also a filter installed
> for the new personality too. Or just disallow personality changes, wasn't
> that what your patch did anyway?
Yup :) You can't predict _all_ personality-ish changes (at least with x86).
>> If we do a new ABI, not only does that
>> have to be exported to userland, but it means we're still copying and
>> munging the data around (which was why I was trying to see if pt_regs
>> was a easier way to get a speed boost).
>
> The difference is that the register ABI uses the messy ptrace ABI which
> is a bit strange and not cross-platform, while simply exporting system
> call arguments is plain simple and what everyone tries to get out of the
> pt_regs anyway.
user_regs_struct != pt_regs. user_regs_struct is acquired using
regviews which is already provided by each arch for use by coredump
generation and PTRACE_[GS]ETREGS. There is no messy ptrace ABI.
Also, the whole point of the patch series was to create filters that
were not cross-platform. I don't believe that is the kernel's job.
system calls are inherently arch specific so why go to all the effort
to hide that?
> But considering system call numbers are platform dependent anyway, it
> doesn't really matter that much. I think an array of syscall arguments
> would be a cleaner interface, but struct pt_regs would be acceptable.
As I said, user_regs_struct is the portable pt_regs, so I don't see
why it's a problem. But, using syscall_get_arguments is doable too if
that's the route this goes.
>> If there's consensus, I'll change it (and use syscall_get_arguments),
>> but I don't believe it makes sense. (more below)
>
> It's about system call filtering, I'd argue that giving anything else than
> the arguments doesn't make sense. Yes, the registers are the current system
> call ABI and reusing that is in a way simpler, but that ABI is about telling
> the kernel what the arguments are, it's not the best way for the kernel to
> tell userspace what the arguments are because for userspace it ends up in
> some structure with its own ABI instead of in the actual registers.
I don't see this disconnect. The ABI is as expected by userspace.
Giving an array of arguments and a system call number will work fine,
but it is not a known ABI. We can create a new one, but I don't
believe this argument justifies it. It's not what the kernel is
telling user space, it's what is safe to evaluate in the kernel using
what userspace knows without adding another new interface. If a new
interface is what it takes to get this merged, then I'll clearly do
it, but I'm still not sold that it is actually better.
>>> As everyone dealing with system call is mostly interested in the same things:
>>> The syscall number and the arguments. You can add some other potentially useful
>>> info like instruction pointer as well, but keep it limited to cross-platform
>>> things with a clear meaning that make sense for system call filtering.
>>
>> Well there are also clone/fork, sig_rt_return, etc related registers
>> too.
>
> What special clone/fork registers are you talking about?
On x86, si is used to indicate the tls area:
http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/process_32.c#L238
(and r8 on 64-bit). Also segment registers, etc.
> I don't think anyone would want to ever filter sig_rt_return, you can as
> well kill the process.
Why make that decision for them?
>> I like not making the decision for each syscall filtering
>> consumer. �We have an ABI so it seems like I'd be making work for the
>> kernel to manage yet another one for system call calling conventions.
>
> I think it's pretty much about the arguments and not much else. Even
> adding instruction pointer was a bit of a stretch, but it's something
> I can imagine people using to make decisions. But as the BPF filters
> are stateless they can't really use much else than the syscall number
> and arguments anyway, the rest is usually too context dependent.
>
> In order of exponentially less likelihood to filter on:
> - syscall number
> - syscall arguments
> - Instruction pointer
> - Stack pointer
> - ...?!
>
> Keep in mind that x86 only has a handful registers, 6 of them are used
> for the arguments, one is the syscall number and return value, one is
> the instruction pointer and there's the stack pointer. There just isn't
> much room for much else.
>
> Adding the instruction and stack pointers is quite a stretch already and
> should cover pretty much any need. If there is any other information that
> might be useful for filtering system calls I'd like to hear about it.
At this point, why create a new ABI? Just use the existing fully
support register views expressed via user_regs_struct.
That said, I can imagine filtering on other registers as being useful
for tentative research. Think of the past work where control flow
integrity was done by XORing the system call number with a run-time
selected value. Instead of doing that, you could populate a
non-argument register with the xor of the syscall number and the
secret (picked and then added to the BPF program before install).
I'm not saying this is a good idea, but it seems silly to exclude it
when there doesn't seem to be any specific gain and only added
kernel-side complexity. It may also be useful to know what other
saved registers (segment, etc) depending on what sort of sandboxing is
being done. The PC/IP is fun for that one since you could limit all
syscalls to only come from the vdso or vsyscall locations.
>>> So I propose an interface like the following instead of a register interface:
>>>
>>> /* Currently 6, but to be future proof, make it 8 */
>>> #define MAX_SC_ARGS � � 8
>>>
>>> struct syscall_bpf_data {
>>> � � � �unsigned long syscall_nr;
>>> � � � �unsigned long flags;
>>> � � � �unsigned long instruction_pointer;
>>> � � � �unsigned long arg[MAX_SC_ARGS];
>>> � � � �unsigned long _reserved[5];
>>> };
>
> BTW, the width of the fields depends on how you want to resolve
> the 64 bit issue. As BPF is always 32 bits, it doesn't make much
> sense to use longs. And as offsets are used anyway, it probably
> makes more sense to define those instead of a structure.
Yup. I'm still not sold on needing a standalone ABI for this when it
is some combination of syscall_get_arguments and KSTK_EIP, since
user_regs_struct already handles the right type widths, etc. In fact,
it gets a bit more challenging.
If you look at syscall_get_arguments for x86, it always uses unsigned
long even when it is a TS_COMPAT task:
lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
That means that the BPF exposed types would either always need to be
unsigned long, irrespective of is_compat_task, or seccomp filter would
need additional per-arch fixups (which is what regviews already do :/
).
>>>
>>> The flag argument can be used to e.g. tell if it is a compat 32 program
>>> running on a 64 bit system.
>>
>> I certainly considered this, but I don't think this is a practical
>> idea. �Firstly, CONFIG_COMPAT is meant to be compatibility mode. �We
>> can't assume a program knows about it. �Second, if we assume any new
>> program will be "smart" and check @flags, then the first few
>> instruction of _every_ (32-bit) seccomp filter program will be
>> checking compat mode - a serious waste :( �I'm also not sure if
>> is_compat_task actually covers all random personality-based changes --
>> just 32-bit v 64-bit.
>
> Yeah, bad idea. Forget about the flag thing.
It seems to elegant too :(
>> I _really_ wanted to make compat a flag and push that logic out of the
>> kernel, but I don't think it makes sense to burden all ABI consumers
>> with a "just in case" compat flag check. �Also, what happens if a new,
>> weird architecture comes along where that flag doesn't make the same
>> sense? �We can fix all the internal kernel stuff, but we'd end up with
>> an ABI change to boot :/ �Using regviews, we stay consistent
>> regardless of whatever the new craziness is. �I just wish there was a
>> way to make it speedier.
>
> Better to have filters per personality. That solves this whole issue,
> independently of regviews or argument list ABI.
Agreed -- except that, as I mentioned above, there are still
significant complexities kernel-side if anything other than regviews
are used.
>>> This way the registers have to be interpreted only once by the kernel and all
>>> filtering programs don't have to do that mapping themselves. It also avoids
>>> doing unnecessary work fiddling/translating registers like the ptrace ABI does.
>>
>> The kernel does only interpret them once (after entry to
>> __secure_computing).
>
> Not if data shuffling is needed for compat related stuff.
I agree! user_regs_struct get rid of the data shuffling. pt_regs and
syscall_get_arguments all seem to induced data shuffling for compat
junk. I just wish pt_regs was compat-width appropriate, but it makes
sense that a 64-bit kernel with a 32-bit program would use 64-bit
registers on its side. Just frustrating.
>> It gets the regview and has it populate a
>> user_regs_struct. �All the register info is per-arch and matches
>> PTRACE_GETREGS, but not PTRACE_PEEKUSR.
>
> GETREGS seems to be a subset of PEEKUSR. That is, both start with
> a struct pt_regs/user_regs_struct (seems to be the same thing?)
Not quite. on x86-32, pt_regs and user_regs_struct are identical.
Power PC as well, I think. They diverge on pretty much every other
platform. Also, x86 compat has some trickiness. pt_regs is 64-bit on
x86-64 even with compat processes. Instead what happens is the naming
is kept if __KERNEL__ such that there aren't different struct member
names in all the syscall.h and ptrace code. The
IA32_EMULATION/TS_COMPAT stuff can then just use the reordered member
names without even more #ifdef madness.
user_regs_struct will use the correct width according to the process
personality. On all arches with is_compat_task support, this matches
-- except x86. With x86, you can force a 32-bit syscall entry from a
64-bit process resulting in a temporary setting of TS_COMPAT but with
a personality that is still 64-bit. This is an edge case and one I
think forcing compat and personality to not-change addresses.
> PEEKUSR only has extra access to debugging registers.
GETREGS uses a regview:
http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1153
PEEKUSR uses getreg or getreg32 directly (on x86). compat_arch_ptrace
on x86 will then grab a specified register based on the 32-bit offsets
out of a 64-bit pt_regs and can return any register offset, like
ORIG_EAX:
lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1026
It's this magic fixup that allows ptrace to just use pt_regs for
PEEKUSR while GETREGS is forced to do the full register copy.
> That is another problem of giving a register view: Which registers
> are you going to give access to?
Always the general set. This is the set denoted by core_note_type
NT_PRSTATUS on all architectures as far as I can tell.
>> All the weird stuff is in
>> PEEKUSR to deal with the fact that compat pt_regs members are not
>> actually the same width as userspace would expect.
>>
>> If we populated an ABI as you've proposed, we'd at least need to
>> build that data set and give it syscall_get_arguments() output.
>
> Yes, but that's all you have to do, nothing more.
I do even less now! :)
> The pt_regs a 64 bit kernel builds for a 32 bit compat process is
> different than one from a 32 bit kernel, so you have to do some kind
> of data shuffling anyway.
Yes - that's why I use user_regs_struct.
> Worse, once you pick this ABI you're stuck with it and can't get rid
> of compat horrors like you have now with ptrace(). Do you really want
> to reuse an obscure ptrace ABI instead of creating a simpler new one?
Exactly why I'm using user_regs_struct. I think we've been having
some cross-talk, but I'm not sure. The only edge case I can find with
user_regs_struct across all platforms is the nasty x86 32-bit entry
from 64-bit personality. Perhaps someday we can just nuke that path
:) But even so, it is tightly constrained by saving the personality
and compat flag in the filter program metadata and checking it at each
syscall.
>> I was hoping I could just hand over pt_regs and avoid any processing,
>> but it doesn't look promising. �In theory, all the same bit-twiddling
>> compat_ptrace does could be done during load_pointer in the patch
>> series, but it seems wrong to go that route.
>
> Your problem is worse because BPF programs are 32 bits but registers/args
> can be 64 bit. Compared to that, running 32 bits on top of 64 bits seems
> easy.
>
> Do you propose that people not only know about 64 bitness, but also
> about endianness when grabbing bits and pieces of 64 bit registers?
> Because that seems like a fun source of bugs.
Endianness calling convention specific. For arches that allow
endianness changes, that should be personality based. I believe that
"people" don't need to know anything unless they are crafting BPF
filters by hand, but I do believe that the userland software they rely
on should understand the current endianness and system call
conventions. glibc already has to know this stuff, and so does any
other piece of userland code directly interacting with the kernel, so
I don't believe it is an hardship on userland. It certainly isn't
shiny and isn't naturally intuitive, but those don't seem like the
only guiding requirements. Making it cross-arch and future-friendly
using what user-space is already aware of seems like it will result in
a robust ABI less afflicted by bit rot or the addition of a crazy new
128-bit architecture :) But who knows.
>>> I missed if the original version was allowed to change the registers or not,
>>> if it is then perhaps the BPF program should set a specific flag after changing
>>> anything, to make it more explicit.
>>
>> Registers are const from the BPF perspective (just like with socket
>> filters). � Adding support for tracehook interception later could
>> allow for supervisor guided register mutation.
>
> If the ABI gives access to arguments instead of registers you don't have
> to do anything tricky: No security checks, no need for fixing up register
> values to their original value after the system call returns or any other
> subtleties. BPF filters can just change the values without side effects.
BPF programs should never change any filters. BPF does not have the
capability to modify the data it is evaluating. Doing that would
require a BPF change and alter its very nature, imo.
While arguments seem tidy, we still end up with the nasty compat pain
and it is only worse kernel-side since there'd be no arch-independent
way to get the correct width system call arguments. I'd need to guess
they were 32-bit and downgrade them if compat. That or add a new arch
callout. Very fiddly :/
> I would prefer if it would work nicely with a ptrace supervisor, because
> to me it seems that if something can't be resolved in the BPF filter, more
> context and direct control is needed. The main downside of ptrace for
> jailing is its overhead (and some quirks). If that can be taken away for
> most system calls by using BPF then it would be useful for my use case.
I could not agree more. I have a patch already in existence that adds
a call to tracehook_syscall_entry on failure under certain conditions,
but I didn't want to bog down discussion of the core feature with that
discussion too. I think supporting a ptrace supervisor would allow
for better debugging and sandbox development. (Then I think most of
the logic could move directly to BPF. E.g., only allow pointer
arguments for open() to live in this known read-only memory, etc.)
Cheers!
will
Yup - we could make the assumption that is_compat_task is always
32-bit and the pt_regs is always 64-bit, then copy_and_truncate with
regs_64_to_32. Seems kinda wonky though :/
> Doesn't matter. I think Indan has a better suggestion.
I disagree, but perhaps I'm not fully understanding!
Thanks!
will
Nice catch:
lxr.linux.no/linux+v3.2.1/arch/x86/kernel/vsyscall_64.c#L180
I'd missed it.
>�ソスIf you want filtering to work, there
> aren't any real syscall registers to inspect, but they could be
> synthesized.
Hrm, I wonder if making sure orig_eax is populated with the
vsyscall_nr would be enough. Unless I'm misreading, args 0 and 1 are
correct, so there may be other noise, but performing a call to
__secure_computing() (either in the case or with a pre-validate
syscall nr: 0-2) should send the do_exit. Does that sound reasonable?
I'll try to do the right thing in my next patch set.
> Preventing a malicious task from figuring out approximately what time
> it is is basically impossible because of the way that vvars work. �ソスI
> don't know how to change that efficiently.
There are other ways to guess the time too, so I don't think it's that
bad. For those that are really worried, they could disable or
otherwise attempt to limit vsyscall access from their sandbox.
thanks!
I hope you're right.
If BPF is always 32 bits and people have to deal with the 64 bit pain
anyway, you can as well have one fixed ABI that is always the same and
cross-platform by just making the arguments always 64 bit, with lower
and upper halves in a fixed order to avoid endianness problems. This
would be compat and future safe.
>> So considering that you seem to be stuck with running 32 bits BPF
>> filters anyway, you can as well make the input always 32 bits or
>> always 64 bits. 32 bits would lose information sometimes, so making
>> it always 64 bits without sign extension seems logical.
>
> This is not fully formed logic. BPF can operate on values that exceed
> its "bus" width. Just because it can't do a load/jump on a 64-bit
> value doesn't mean you can't implement a jump based on a 64-bit value.
> You just split it up. Load the high order 32-bits, jump on failure,
> fall through and load the next 32-bits and do the rest of the
Yes, hence my proposal to just bite the bullet and provide a fixed,
cross-platform 64 bit argument interface.
> comparison. Extending Eric's python translator wouldn't be
> particularly painful and then it would be transparent to an end user.
Your end user uses your ABI directly, Eric's python translator is only
one of them.
>> This would
>> uglify the actual BPF filters immensely though, because then they
>> often have to check the upper argument half too, usually just to
>> make sure it's zero. They can't be sure that the kernel will ignore
>> the upper half for 'int' arguments.
>
> Of course not! This will uglify the filters until someday when BPF
> grows 64-bit support (or never), but it's not that bad. The BPF
> doesn't need to be pretty, just effective. And it could be made even
> easier with JIT support enabled since it could provide better native
> code behaviors.
What JIT? If there is one I doubt it's smart enough to consolidate two
32 bit operations into one 64 bit operation.
>> So I suppose the question is: How do you expect people to use this on
>> 64 bit systems?
>
> As mentioned above. The whole point of using BPF and user_regs_struct
> is to implement _less_ kernel magic instead of more.
At the cost of making it cross-platform and harder to use. I think it is
a bit sad that the code still ends up being so platform dependent while
it is running in a virtual machine.
And you still have to fix up the compat case.
[...]
>>> If we do a new ABI, not only does that
>>> have to be exported to userland, but it means we're still copying and
>>> munging the data around (which was why I was trying to see if pt_regs
>>> was a easier way to get a speed boost).
>>
>> The difference is that the register ABI uses the messy ptrace ABI which
>> is a bit strange and not cross-platform, while simply exporting system
>> call arguments is plain simple and what everyone tries to get out of the
>> pt_regs anyway.
>
> user_regs_struct != pt_regs. user_regs_struct is acquired using
> regviews which is already provided by each arch for use by coredump
> generation and PTRACE_[GS]ETREGS. There is no messy ptrace ABI.
How is exporting registers via a structure not messy? And if PEEKUSR
uses a different ABI then ptrace's ABI is very messy. And it gets messy
whenever you cross a 32/64 bit boundary.
> Also, the whole point of the patch series was to create filters that
> were not cross-platform. I don't believe that is the kernel's job.
It's the kernel's job to provide an abstracted view of the hardware so
user space has a consistent view. Not trying to make it cross-platform
is just slacking.
> system calls are inherently arch specific so why go to all the effort
> to hide that?
Because, although the numbers are certainly arch specific, the system calls
themselves including the argument ordering are surprisingly consistent.
The numbers are handled by the SYS_* defines, so when porting to a different
arch people just have to check if the arguments they use still match and
that's it. If you provide registers they have to put more effort into
porting the code.
>> But considering system call numbers are platform dependent anyway, it
>> doesn't really matter that much. I think an array of syscall arguments
>> would be a cleaner interface, but struct pt_regs would be acceptable.
>
> As I said, user_regs_struct is the portable pt_regs, so I don't see
> why it's a problem. But, using syscall_get_arguments is doable too if
> that's the route this goes.
It's not portable because it is different for every arch.
>>> If there's consensus, I'll change it (and use syscall_get_arguments),
>>> but I don't believe it makes sense. (more below)
>>
>> It's about system call filtering, I'd argue that giving anything else than
>> the arguments doesn't make sense. Yes, the registers are the current system
>> call ABI and reusing that is in a way simpler, but that ABI is about telling
>> the kernel what the arguments are, it's not the best way for the kernel to
>> tell userspace what the arguments are because for userspace it ends up in
>> some structure with its own ABI instead of in the actual registers.
>
> I don't see this disconnect. The ABI is as expected by userspace.
> Giving an array of arguments and a system call number will work fine,
> but it is not a known ABI. We can create a new one, but I don't
> believe this argument justifies it. It's not what the kernel is
> telling user space, it's what is safe to evaluate in the kernel using
> what userspace knows without adding another new interface. If a new
> interface is what it takes to get this merged, then I'll clearly do
> it, but I'm still not sold that it is actually better.
You are adding a new interface anyway with this feature. Using user_regs_struct
is not something expected by userspace, except if it are hardcore ptrace users.
And they will get the offsets wrong if they are on 64 bits because those are
different than for ptrace. The ptrace ABI uses longs, BPF is fixed to 32 bits,
it's just not a good fit.
Put another way, why isn't user_regs_struct passed on to each system call
implementation in the kernel instead of the arguments? It's exactly the
same reason as why passing arguments is better for BPF too.
>>>> As everyone dealing with system call is mostly interested in the same things:
>>>> The syscall number and the arguments. You can add some other potentially useful
>>>> info like instruction pointer as well, but keep it limited to cross-platform
>>>> things with a clear meaning that make sense for system call filtering.
>>>
>>> Well there are also clone/fork, sig_rt_return, etc related registers
>>> too.
>>
>> What special clone/fork registers are you talking about?
>
> On x86, si is used to indicate the tls area:
> http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/process_32.c#L238
> (and r8 on 64-bit). Also segment registers, etc.
si is just the 4th (5th for x86_64) system call argument, it's nothing special.
System calls never use segment registers directly, do they? It's not part
of the system call ABI, so why would you want to use them in filtering?
>> I don't think anyone would want to ever filter sig_rt_return, you can as
>> well kill the process.
>
> Why make that decision for them?
I don't, I'm just saying it doesn't make sense to filter it. Based on what
would anyone ever want to filter it? It's just a kernel helper thing to
implement signal handlers.
But now you mention it, it won't be bad to enforce this in the kernel,
otherwise everyone has to add code to allow it. Same for exit(_group).
Because if those are denied, there's nothing else to run instead.
>>> I like not making the decision for each syscall filtering
>>> consumer. �We have an ABI so it seems like I'd be making work for the
>>> kernel to manage yet another one for system call calling conventions.
>>
>> I think it's pretty much about the arguments and not much else. Even
>> adding instruction pointer was a bit of a stretch, but it's something
>> I can imagine people using to make decisions. But as the BPF filters
>> are stateless they can't really use much else than the syscall number
>> and arguments anyway, the rest is usually too context dependent.
>>
>> In order of exponentially less likelihood to filter on:
>> - syscall number
>> - syscall arguments
>> - Instruction pointer
>> - Stack pointer
>> - ...?!
>>
>> Keep in mind that x86 only has a handful registers, 6 of them are used
>> for the arguments, one is the syscall number and return value, one is
>> the instruction pointer and there's the stack pointer. There just isn't
>> much room for much else.
>>
>> Adding the instruction and stack pointers is quite a stretch already and
>> should cover pretty much any need. If there is any other information that
>> might be useful for filtering system calls I'd like to hear about it.
>
> At this point, why create a new ABI? Just use the existing fully
> support register views expressed via user_regs_struct.
It's the difference between 6 args + a couple extras versus 17 registers
for a register starved arch like x86.
But yeah, better to not provide the instruction or stack pointers indeed.
At least the instruction pointer gives some system call related information
(from where it is called).
> That said, I can imagine filtering on other registers as being useful
> for tentative research.
They can use ptrace for that.
> Think of the past work where control flow
> integrity was done by XORing the system call number with a run-time
> selected value. Instead of doing that, you could populate a
> non-argument register with the xor of the syscall number and the
> secret (picked and then added to the BPF program before install).
What non-argument register would you like to use on x86? I think all
are used up already. All you got left is the segment registers, and
using those seems a bad idea. It also seems a bad idea to promote
non-portable BPF filtering programs.
If you support modifying arguments and syscall nr then people can keep
doing the XORing trick with BPF. Another advantage of allowing that is
that unsafe old system calls can be replaced with secure ones on the
fly transparently.
Really, disallowing modifications is much more limiting than not providing
all registers. But allowing modifications is a lot harder to get right
with a register interface.
> I'm not saying this is a good idea, but it seems silly to exclude it
> when there doesn't seem to be any specific gain and only added
> kernel-side complexity. It may also be useful to know what other
> saved registers (segment, etc) depending on what sort of sandboxing is
> being done. The PC/IP is fun for that one since you could limit all
> syscalls to only come from the vdso or vsyscall locations.
Problem is that that is less useful than it seems because malicious code
can always just jump to a syscall entry instruction. Randomization helps
a bit, but it gives no guarantees. Better to store an XORed secret in the
syscall nr and arguments, that gives up to 224 bits of security.
>> BTW, the width of the fields depends on how you want to resolve
>> the 64 bit issue. As BPF is always 32 bits, it doesn't make much
>> sense to use longs. And as offsets are used anyway, it probably
>> makes more sense to define those instead of a structure.
>
> Yup. I'm still not sold on needing a standalone ABI for this when it
> is some combination of syscall_get_arguments and KSTK_EIP, since
> user_regs_struct already handles the right type widths, etc. In fact,
> it gets a bit more challenging.
I would go for system call number + arguments only, and forget about the
EIP and stack, except if people really want it. But if you do add it then
it's barely any less limiting than a register view.
> If you look at syscall_get_arguments for x86, it always uses unsigned
> long even when it is a TS_COMPAT task:
> lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
> That means that the BPF exposed types would either always need to be
> unsigned long, irrespective of is_compat_task, or seccomp filter would
> need additional per-arch fixups (which is what regviews already do :/
> ).
If compat tasks are involved you are screwed anyway and have to fiddle
with data, it's unavoidable.
Arguments exposed to BPF should always be 64 bits even on 32 bit archs,
that solves all compat and portability problems.
I really don't see the problem of copying 6 arguments to a fixed place.
If that is tricky then you're either trying to use the wrong function
or doing it at the wrong place in the kernel. I'd expect that passing on
the arguments is highly optimised in the kernel, all system calls have
easy access to them, why would it be hard for the BPF code to get it?
If you use syscall_get_arguments you have to call it once for each arg
instead of calling it once and trying to fix up the 32/64 bit and
endianness afterwards.
So call it once and store the value in a long. Then copy the low half
to the right place and then the upper half when on 64 bits. It may not
look too pretty, but the compiler should be able to optimise almost all
overhead away and end up with 6 (or 12) int copies. Something like this:
struct bpf_data {
uint32 syscall_nr;
uint32 arg_low[MAX_SC_ARGS];
uint32 arg_high[MAX_SC_ARGS];
};
void fill_bpf_data(struct task_struct *t, struct pt_regs *r, struct bpf_data *d)
{
int i;
unsigned long arg;
d->syscall_nr = syscall_get_nr(t, r);
for (i = 0; i < MAX_SC_ARGS; ++i){
syscall_get_arguments(t, r, i, 1, &arg);
d->arg_low[i] = arg;
d->arg_high[i] = arg >> 32;
}
}
>
> Agreed -- except that, as I mentioned above, there are still
> significant complexities kernel-side if anything other than regviews
> are used.
I'm missing what those complexities are.
>
>>>> This way the registers have to be interpreted only once by the kernel and all
>>>> filtering programs don't have to do that mapping themselves. It also avoids
>>>> doing unnecessary work fiddling/translating registers like the ptrace ABI does.
>>>
>>> The kernel does only interpret them once (after entry to
>>> __secure_computing).
>>
>> Not if data shuffling is needed for compat related stuff.
>
> I agree! user_regs_struct get rid of the data shuffling. pt_regs and
> syscall_get_arguments all seem to induced data shuffling for compat
> junk. I just wish pt_regs was compat-width appropriate, but it makes
> sense that a 64-bit kernel with a 32-bit program would use 64-bit
> registers on its side. Just frustrating.
Are user_regs_struct entries 32-bit for 32-bit tasks or is it 64-bit if
the kernel is 64-bit? If they're 64-bit then you didn't get rid of the
data shuffling.
>>> It gets the regview and has it populate a
>>> user_regs_struct. �All the register info is per-arch and matches
>>> PTRACE_GETREGS, but not PTRACE_PEEKUSR.
>>
>> GETREGS seems to be a subset of PEEKUSR. That is, both start with
>> a struct pt_regs/user_regs_struct (seems to be the same thing?)
>
>
> Not quite. on x86-32, pt_regs and user_regs_struct are identical.
> Power PC as well, I think. They diverge on pretty much every other
> platform. Also, x86 compat has some trickiness. pt_regs is 64-bit on
> x86-64 even with compat processes. Instead what happens is the naming
> is kept if __KERNEL__ such that there aren't different struct member
> names in all the syscall.h and ptrace code. The
> IA32_EMULATION/TS_COMPAT stuff can then just use the reordered member
> names without even more #ifdef madness.
It was a surprise to me to find out that the pt_regs a 64-bit ptrace user
gets for a 32 bit tracee differs from the pt_regs when both are 32 bits.
> user_regs_struct will use the correct width according to the process
> personality. On all arches with is_compat_task support, this matches
> -- except x86. With x86, you can force a 32-bit syscall entry from a
> 64-bit process resulting in a temporary setting of TS_COMPAT but with
> a personality that is still 64-bit. This is an edge case and one I
> think forcing compat and personality to not-change addresses.
How's that possible? Setting CS to 0x23? Can userspace do that?
>> PEEKUSR only has extra access to debugging registers.
>
> GETREGS uses a regview:
> http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1153
> PEEKUSR uses getreg or getreg32 directly (on x86). compat_arch_ptrace
> on x86 will then grab a specified register based on the 32-bit offsets
> out of a 64-bit pt_regs and can return any register offset, like
> ORIG_EAX:
> lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1026
>
> It's this magic fixup that allows ptrace to just use pt_regs for
> PEEKUSR while GETREGS is forced to do the full register copy.
>
>> That is another problem of giving a register view: Which registers
>> are you going to give access to?
>
> Always the general set. This is the set denoted by core_note_type
> NT_PRSTATUS on all architectures as far as I can tell.
>
>>> All the weird stuff is in
>>> PEEKUSR to deal with the fact that compat pt_regs members are not
>>> actually the same width as userspace would expect.
>>>
>>> If we populated an ABI as you've proposed, we'd at least need to
>>> build that data set and give it syscall_get_arguments() output.
>>
>> Yes, but that's all you have to do, nothing more.
>
> I do even less now! :)
You have to do more for the compat case.
>> The pt_regs a 64 bit kernel builds for a 32 bit compat process is
>> different than one from a 32 bit kernel, so you have to do some kind
>> of data shuffling anyway.
>
> Yes - that's why I use user_regs_struct.
But there are different versions of user_regs_struct, depending on the
situation. This implies that the BPF filters have to be different too,
while they could be exactly the same (except for the syscall nr).
>> Worse, once you pick this ABI you're stuck with it and can't get rid
>> of compat horrors like you have now with ptrace(). Do you really want
>> to reuse an obscure ptrace ABI instead of creating a simpler new one?
>
> Exactly why I'm using user_regs_struct. I think we've been having
> some cross-talk, but I'm not sure. The only edge case I can find with
> user_regs_struct across all platforms is the nasty x86 32-bit entry
> from 64-bit personality. Perhaps someday we can just nuke that path
> :) But even so, it is tightly constrained by saving the personality
> and compat flag in the filter program metadata and checking it at each
> syscall.
I think it's a good idea to nuke that path, it seems like a security hole
waiting to happen.
>
>>> I was hoping I could just hand over pt_regs and avoid any processing,
>>> but it doesn't look promising. �In theory, all the same bit-twiddling
>>> compat_ptrace does could be done during load_pointer in the patch
>>> series, but it seems wrong to go that route.
>>
>> Your problem is worse because BPF programs are 32 bits but registers/args
>> can be 64 bit. Compared to that, running 32 bits on top of 64 bits seems
>> easy.
>>
>> Do you propose that people not only know about 64 bitness, but also
>> about endianness when grabbing bits and pieces of 64 bit registers?
>> Because that seems like a fun source of bugs.
>
> Endianness calling convention specific. For arches that allow
> endianness changes, that should be personality based. I believe that
> "people" don't need to know anything unless they are crafting BPF
> filters by hand, but I do believe that the userland software they rely
> on should understand the current endianness and system call
> conventions. glibc already has to know this stuff, and so does any
> other piece of userland code directly interacting with the kernel, so
> I don't believe it is an hardship on userland. It certainly isn't
> shiny and isn't naturally intuitive, but those don't seem like the
> only guiding requirements. Making it cross-arch and future-friendly
> using what user-space is already aware of seems like it will result in
> a robust ABI less afflicted by bit rot or the addition of a crazy new
> 128-bit architecture :) But who knows.
If your ABI is too hard to use directly, it won't be used at all.
Any excuse that people won't use this ABI directly is a sign that
it is not good enough.
And the more complicated you make it, the less likely it is that
anyone will use this.
>
>>>> I missed if the original version was allowed to change the registers or not,
>>>> if it is then perhaps the BPF program should set a specific flag after changing
>>>> anything, to make it more explicit.
>>>
>>> Registers are const from the BPF perspective (just like with socket
>>> filters). � Adding support for tracehook interception later could
>>> allow for supervisor guided register mutation.
>>
>> If the ABI gives access to arguments instead of registers you don't have
>> to do anything tricky: No security checks, no need for fixing up register
>> values to their original value after the system call returns or any other
>> subtleties. BPF filters can just change the values without side effects.
>
> BPF programs should never change any filters. BPF does not have the
> capability to modify the data it is evaluating. Doing that would
> require a BPF change and alter its very nature, imo.
It could if you make the data part of the scratch memory. If you put the
data at the top, just after BPF_MEMWORDS, then it's all compatible with
the read-only version. Except the sk_chk_filter() code. But if you ever
want to consolidate with the networking version, then you already need
new non-byteswapping instructions. You can as well add a special modify
instruction too then. Making it very explicit seems better anyway.
Using BPF for system call filtering changes its very nature already.
I must say that until your patch came up, I've never heard of BPF filters
before. I think I'm going to use it in our ptrace jailer for network
filtering, if it's possible to get the peer address for normal TCP/UDP
sockets. Documentation is quite vague.
> While arguments seem tidy, we still end up with the nasty compat pain
> and it is only worse kernel-side since there'd be no arch-independent
> way to get the correct width system call arguments. I'd need to guess
> they were 32-bit and downgrade them if compat. That or add a new arch
> callout. Very fiddly :/
See code above. It seems fairly tidy to me.
You could also do the BPF filtering later in the system call entry path
when the arguments are passed directly, but then it's harder to interact
well with ptrace.
>
>> I would prefer if it would work nicely with a ptrace supervisor, because
>> to me it seems that if something can't be resolved in the BPF filter, more
>> context and direct control is needed. The main downside of ptrace for
>> jailing is its overhead (and some quirks). If that can be taken away for
>> most system calls by using BPF then it would be useful for my use case.
>
> I could not agree more. I have a patch already in existence that adds
> a call to tracehook_syscall_entry on failure under certain conditions,
> but I didn't want to bog down discussion of the core feature with that
> discussion too. I think supporting a ptrace supervisor would allow
> for better debugging and sandbox development. (Then I think most of
> the logic could move directly to BPF. E.g., only allow pointer
> arguments for open() to live in this known read-only memory, etc.)
That is very hard to do in practise except for very limited sandboxing
cases. In the general case you want to check all paths, but knowing
beforehand where those are stored is hard when running arbitrary stuff.
And it doesn't guarantee that it are safe path, because they can start
in the middle of a stored path and turn an absolute path into a relative
one.
And updating the filters on the run all the time is a hassle too. So
I think most logic will stay out of BPF, especially because it is the
more tricky stuff to do. But open() is not that performance critical
compared to stuff that happens all the time and where you really don't
want the ptrace overhead, like gettimeofday().
By the way, I think you want the filter to decide with what error code
the system call fails instead of hard coding it to EACCESS. So just use
the return value instead of checking against regs_size, which doesn't
make much sense anyway. Then you also have a way for the filter to tell
whether the system call should be passed on to ptrace or not.
Ideally, the BPF filter should be able to deny the system call with a
specific error code, deny the call and kill the task, have a way to
defer to ptrace, and a way to allow it.
Greetings,
Indan
much simpler/faster than what regset does to create the artificial
user_regs_struct32.
> > Doesn't matter. I think Indan has a better suggestion.
>
> I disagree, but perhaps I'm not fully understanding!
I have much more chances to be wrong ;) I leave it to you and Indan.
Oleg.
True, I could collapse pt_regs to looks like the exported ABI pt_regs.
Then only compat processes would get the copy overhead. That could
be tidy and not break ABI. It would mean that I have to assume that
if unsigned long == 64-bit and is_compat_task(), then the task is
32-bit. Do you think if we ever add a crazy 128-bit "supercomputer"
arch that we will add a is_compat64_task() so that I could properly
collapse? :)
I like this idea!
>> > Doesn't matter. I think Indan has a better suggestion.
>>
>> I disagree, but perhaps I'm not fully understanding!
>
> I have much more chances to be wrong ;) I leave it to you and Indan.
We're being very verbose. I hope we can come to a good place! I took
a break from my response to reply here :)
thanks!
will
FWIW, it's possible for a task to execute in 32-bit mode when
!is_compat_task or in 64-bit mode when is_compat_task. From earlier
in the thread, I think you were planning to block the wrong-bitness
syscall entries, but it's worth double-checking that you don't open up
a hole when a compat task issues the 64-bit syscall instruction.
(is_compat_task says whether the executable was marked as 32-bit. The
actual execution mode is determined by the cs register, which the user
can control. See the user_64bit_mode function in
arch/asm/x86/ptrace.h. But maybe it would make more sense to have a
separate 32-bit and 64-bit BPF program and select which one to use
based on the entry point.)
--Andy
Yup - I had to (see below).
> (is_compat_task says whether the executable was marked as 32-bit. �The
> actual execution mode is determined by the cs register, which the user
> can control. �See the user_64bit_mode function in
> arch/asm/x86/ptrace.h. �But maybe it would make more sense to have a
> separate 32-bit and 64-bit BPF program and select which one to use
> based on the entry point.)
So that was my original design, but the problem was with how regviews
decides on the user_regs_struct. It decides using TIF_IA32 while I
can only check the cross-arch is_compat_task() which checks TS_COMPAT
on x86. If I'm just collapsing registers for compat calls (which I am
exploring the viability of right now), then I guess I could re-fork
the filtering to support compat versus non-compat. The nastier bits
there were that I don't want to allow a compat call to be allowed
because a process only defined non-compat. I think that can be made
manage-able though.
I'll finish proving out the possibilities here.
Thanks!
will
Confused... Afaics, TIF_IA32 says that the binary is 32-bit (this comes
along with TS_COMPAT).
TS_COMPAT says that, say, the task did "int 80" to enters the kernel.
64-bit or not, we should treat is as 32-bit in this case.
No?
Oleg.
What happens if unsigned long is no longer 64-bit in some distant future?
As to endianness, fixed endianess means that userland programs that
have a different endianness will need to translate their values. It's
just shifting the work around.
>>> So considering that you seem to be stuck with running 32 bits BPF
>>> filters anyway, you can as well make the input always 32 bits or
>>> always 64 bits. 32 bits would lose information sometimes, so making
>>> it always 64 bits without sign extension seems logical.
>>
>> This is not fully formed logic. BPF can operate on values that exceed
>> its "bus" width. Just because it can't do a load/jump on a 64-bit
>> value doesn't mean you can't implement a jump based on a 64-bit value.
>> You just split it up. Load the high order 32-bits, jump on failure,
>> fall through and load the next 32-bits and do the rest of the
>
> Yes, hence my proposal to just bite the bullet and provide a fixed,
> cross-platform 64 bit argument interface.
>
>> comparison. Extending Eric's python translator wouldn't be
>> particularly painful and then it would be transparent to an end user.
>
> Your end user uses your ABI directly, Eric's python translator is only
> one of them.
Sure, but how many people write BPF manually today versus those who
use ethereal/wireshark/tcpdump/libpcap? And for network data, the
data protocols may vary per packet.
>>> This would
>>> uglify the actual BPF filters immensely though, because then they
>>> often have to check the upper argument half too, usually just to
>>> make sure it's zero. They can't be sure that the kernel will ignore
>>> the upper half for 'int' arguments.
>>
>> Of course not! This will uglify the filters until someday when BPF
>> grows 64-bit support (or never), but it's not that bad. The BPF
>> doesn't need to be pretty, just effective. And it could be made even
>> easier with JIT support enabled since it could provide better native
>> code behaviors.
>
> What JIT? If there is one I doubt it's smart enough to consolidate two
> 32 bit operations into one 64 bit operation.
There's a BPF JITer in the kernel now. I doubt it does consolidation
either, but with a lightweight lookahead, it would be possible to
collapse known patterns for checking 64-bit values (not in all cases,
but in generic ones).
>>> So I suppose the question is: How do you expect people to use this on
>>> 64 bit systems?
>>
>> As mentioned above. The whole point of using BPF and user_regs_struct
>> is to implement _less_ kernel magic instead of more.
>
> At the cost of making it cross-platform and harder to use. I think it is
> a bit sad that the code still ends up being so platform dependent while
> it is running in a virtual machine.
Not all virtual machines have the same goal. The goal of BPF is to
provide a safe place to evaluate user-supplied instructions over a
fixed window of data to determine acceptance. While BPF for socket
data is arch-independent, each BPF program needs to understand what
the packet data is being operated on. In this case, the
user_regs_struct is the packet data and the BPF program needs to be
tailored to it.
> And you still have to fix up the compat case.
Not really. I lock down the compat case. _Even_ with fixed 64-bit
arguments, you still get system call number mismatches which mean you
need to keep independent filters for the same task. I had this in one
of my first implementations and it adds a nasty amount of implicit
logic during evaluation.
> [...]
>>>> If we do a new ABI, not only does that
>>>> have to be exported to userland, but it means we're still copying and
>>>> munging the data around (which was why I was trying to see if pt_regs
>>>> was a easier way to get a speed boost).
>>>
>>> The difference is that the register ABI uses the messy ptrace ABI which
>>> is a bit strange and not cross-platform, while simply exporting system
>>> call arguments is plain simple and what everyone tries to get out of the
>>> pt_regs anyway.
>>
>> user_regs_struct != pt_regs. user_regs_struct is acquired using
>> regviews which is already provided by each arch for use by coredump
>> generation and PTRACE_[GS]ETREGS. There is no messy ptrace ABI.
>
> How is exporting registers via a structure not messy? And if PEEKUSR
> uses a different ABI then ptrace's ABI is very messy. And it gets messy
> whenever you cross a 32/64 bit boundary.
I'm not following.
>> Also, the whole point of the patch series was to create filters that
>> were not cross-platform. I don't believe that is the kernel's job.
>
> It's the kernel's job to provide an abstracted view of the hardware so
> user space has a consistent view. Not trying to make it cross-platform
> is just slacking.
I disagree. I believe it is overengineering and poor design to create
an brand new interface with new semantics to hide an ABI that userland
is already aware of.
>> system calls are inherently arch specific so why go to all the effort
>> to hide that?
>
> Because, although the numbers are certainly arch specific, the system calls
> themselves including the argument ordering are surprisingly consistent.
>
> The numbers are handled by the SYS_* defines, so when porting to a different
> arch people just have to check if the arguments they use still match and
> that's it. If you provide registers they have to put more effort into
> porting the code.
Yup the __NR_* defines are ABI and so is the argument ordering
(clearly). Of course, userspace knows that the argument ordering is
based on registers since that's how it preps the registers before
calling the arch-specific system call trap. This is why I believe it
makes sense to just store the arch reg mappings in a userland library
rather than do it kernel-side.
>>> But considering system call numbers are platform dependent anyway, it
>>> doesn't really matter that much. I think an array of syscall arguments
>>> would be a cleaner interface, but struct pt_regs would be acceptable.
>>
>> As I said, user_regs_struct is the portable pt_regs, so I don't see
>> why it's a problem. But, using syscall_get_arguments is doable too if
>> that's the route this goes.
>
> It's not portable because it is different for every arch.
I was describing the kernel code, not the data set. By using
regviews, I get the consistent register view for the personality of
the process for the architecture it is running on. This means that
the user_regs_struct will always be consistent _for the architecture_
when given to the user's BPF code. It does not create a portable
userland ABI but instead uses the existing ABI in a way that is
arch-agnostic in the kernel (using the regviews interfaces for arch
fixup).
>>>> If there's consensus, I'll change it (and use syscall_get_arguments),
>>>> but I don't believe it makes sense. (more below)
>>>
>>> It's about system call filtering, I'd argue that giving anything else than
>>> the arguments doesn't make sense. Yes, the registers are the current system
>>> call ABI and reusing that is in a way simpler, but that ABI is about telling
>>> the kernel what the arguments are, it's not the best way for the kernel to
>>> tell userspace what the arguments are because for userspace it ends up in
>>> some structure with its own ABI instead of in the actual registers.
>>
>> I don't see this disconnect. The ABI is as expected by userspace.
>> Giving an array of arguments and a system call number will work fine,
>> but it is not a known ABI. We can create a new one, but I don't
>> believe this argument justifies it. It's not what the kernel is
>> telling user space, it's what is safe to evaluate in the kernel using
>> what userspace knows without adding another new interface. If a new
>> interface is what it takes to get this merged, then I'll clearly do
>> it, but I'm still not sold that it is actually better.
>
> You are adding a new interface anyway with this feature. Using user_regs_struct
> is not something expected by userspace, except if it are hardcore ptrace users.
And anything that parses coredumps.
> And they will get the offsets wrong if they are on 64 bits because those are
> different than for ptrace. The ptrace ABI uses longs, BPF is fixed to 32 bits,
> it's just not a good fit.
That's not true on x86-64:
http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L945
PEEKUSR uses the offsetof() the 32-bit register struct for compat
calls and, with that macro, maps it to the proper entry in pt_regs.
For non-compat, it just uses the offset into pt_regs:
http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L475
lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L170
As there is significant overlap in the contents of user_regs_struct
and pt_regs on the platform. While it's possible for another arch to
use a different set of ptrace ABI offsets, using offsetof(struct
user_regs_struct, ...) will always work. It is not long-based.
If you want to convince me this isn't a good fit, I need you to meet
me halfway and make sure your assertions match the code! :)
> Put another way, why isn't user_regs_struct passed on to each system call
> implementation in the kernel instead of the arguments? It's exactly the
> same reason as why passing arguments is better for BPF too.
This is silly. When you call a system call, what do you do? You
prepare the register state such that the arguments are in the right
positioning. user_regs_struct is the snapshot of the registers the
program-itself prepared.
If you think it is more pleasing to have [syscall_nr|args0|...|args6],
then I can respect that. But so far, the technical arguments have not
backed up that direction.
>>>>> As everyone dealing with system call is mostly interested in the same things:
>>>>> The syscall number and the arguments. You can add some other potentially useful
>>>>> info like instruction pointer as well, but keep it limited to cross-platform
>>>>> things with a clear meaning that make sense for system call filtering.
>>>>
>>>> Well there are also clone/fork, sig_rt_return, etc related registers
>>>> too.
>>>
>>> What special clone/fork registers are you talking about?
>>
>> On x86, si is used to indicate the tls area:
>> http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/process_32.c#L238
>> (and r8 on 64-bit). Also segment registers, etc.
>
> si is just the 4th (5th for x86_64) system call argument, it's nothing special.
True - arguably though fork() takes no arguments. Now
syscall_get_arguments() won't know that, so si/r8 and all will still
be exposed for filtering. For some reason, I thought some other
pieces were used (EFLAGS?), but I can't back that up in the source.
> System calls never use segment registers directly, do they? It's not part
> of the system call ABI, so why would you want to use them in filtering?
On x86-32, you may be using segmentation for isolation purposes. If
you do so, it may be of interest where the call comes from. Nothing
truly relevant, just another possibility. *shrug*
>>> I don't think anyone would want to ever filter sig_rt_return, you can as
>>> well kill the process.
>>
>> Why make that decision for them?
>
> I don't, I'm just saying it doesn't make sense to filter it. Based on what
> would anyone ever want to filter it? It's just a kernel helper thing to
> implement signal handlers.
What if you want the process to die if it returns from any signals?
> But now you mention it, it won't be bad to enforce this in the kernel,
> otherwise everyone has to add code to allow it. Same for exit(_group).
> Because if those are denied, there's nothing else to run instead.
The process will be do_exit()d. I don't know why it matters?
Yup - it's nice to have that.
>> That said, I can imagine filtering on other registers as being useful
>> for tentative research.
>
> They can use ptrace for that.
And it will stay research forever.
>> Think of the past work where control flow
>> integrity was done by XORing the system call number with a run-time
>> selected value. Instead of doing that, you could populate a
>> non-argument register with the xor of the syscall number and the
>> secret (picked and then added to the BPF program before install).
>
> What non-argument register would you like to use on x86? I think all
> are used up already. All you got left is the segment registers, and
> using those seems a bad idea.
There are other arches where this would be feasible.
> It also seems a bad idea to promote non-portable BPF filtering programs.
Why? If it's possible to make a userland abstraction layer, why do we
force the kernel to take on more work?
> If you support modifying arguments and syscall nr then people can keep
> doing the XORing trick with BPF. Another advantage of allowing that is
> that unsafe old system calls can be replaced with secure ones on the
> fly transparently.
>
> Really, disallowing modifications is much more limiting than not providing
> all registers. But allowing modifications is a lot harder to get right
> with a register interface.
I'm not going to make the change to support BPF making the data
mutable or using scratch space as an output vector. If that is
something that makes sense to the networking stack, then we could
benefit from it, but I don't want to go there.
>> I'm not saying this is a good idea, but it seems silly to exclude it
>> when there doesn't seem to be any specific gain and only added
>> kernel-side complexity. It may also be useful to know what other
>> saved registers (segment, etc) depending on what sort of sandboxing is
>> being done. The PC/IP is fun for that one since you could limit all
>> syscalls to only come from the vdso or vsyscall locations.
>
> Problem is that that is less useful than it seems because malicious code
> can always just jump to a syscall entry instruction. Randomization helps
> a bit, but it gives no guarantees. Better to store an XORed secret in the
> syscall nr and arguments, that gives up to 224 bits of security.
Yup, but rewriting the system call number or arguments would require a
ptrace supervisor without changing the nature of BPF. Also, if
something like this were crafted, a XOR-guess failure would result in
immediate process termination. This allows for lower entropy to still
provide a robust mechanism.
>>> BTW, the width of the fields depends on how you want to resolve
>>> the 64 bit issue. As BPF is always 32 bits, it doesn't make much
>>> sense to use longs. And as offsets are used anyway, it probably
>>> makes more sense to define those instead of a structure.
>>
>> Yup. I'm still not sold on needing a standalone ABI for this when it
>> is some combination of syscall_get_arguments and KSTK_EIP, since
>> user_regs_struct already handles the right type widths, etc. In fact,
>> it gets a bit more challenging.
>
> I would go for system call number + arguments only, and forget about the
> EIP and stack, except if people really want it. But if you do add it then
> it's barely any less limiting than a register view.
What do you think of Oleg's proposal? I like it a lot! If I can just
use pt_regs for all but compat tasks, then it means there _no_ copy
needed to evaluate the BPF. This saves on more than just the
user_regs_struct registers but a lot more. It also avoids
syscall_get_arguments, etc. I'm looking into how to best implement
this, but I think it may be a real option.
It does mean BPF is per-arch, per syscall-convention, but you know I
am fine with that :) I do think the performance gains could be well
worth avoiding any copying. (Perf gains are the strongest argument I
think for your proposal and the thing that would likely lead me to do
it.)
>> If you look at syscall_get_arguments for x86, it always uses unsigned
>> long even when it is a TS_COMPAT task:
>> lxr.linux.no/linux+v3.2.1/arch/x86/include/asm/syscall.h#L97
>> That means that the BPF exposed types would either always need to be
>> unsigned long, irrespective of is_compat_task, or seccomp filter would
>> need additional per-arch fixups (which is what regviews already do :/
>> ).
>
> If compat tasks are involved you are screwed anyway and have to fiddle
> with data, it's unavoidable.
Well I was letting the existing code do that for me.
> Arguments exposed to BPF should always be 64 bits even on 32 bit archs,
> that solves all compat and portability problems.
Not really. It means that if there is ever a 128-bit register arch, a
new ABI would be spawned for it. But I don't know what else would be
broken in the kernel by that, so it's hard to tell if that argument
makes sense.
> I really don't see the problem of copying 6 arguments to a fixed place.
I was indicating the need to truncate them for compat.
> If that is tricky then you're either trying to use the wrong function
> or doing it at the wrong place in the kernel. I'd expect that passing on
> the arguments is highly optimised in the kernel, all system calls have
> easy access to them, why would it be hard for the BPF code to get it?
See the source. They are static inlines but there are still memory
copies. I would then need to copy a second time to truncate them to
the correct width (or some other fanciness).
> If you use syscall_get_arguments you have to call it once for each arg
> instead of calling it once and trying to fix up the 32/64 bit and
> endianness afterwards.
You - or call it once and then iterate over the emitted array doing fixups.
> So call it once and store the value in a long. Then copy the low half
> to the right place and then the upper half when on 64 bits. It may not
> look too pretty, but the compiler should be able to optimise almost all
> overhead away and end up with 6 (or 12) int copies. Something like this:
>
> struct bpf_data {
> uint32 syscall_nr;
> uint32 arg_low[MAX_SC_ARGS];
> uint32 arg_high[MAX_SC_ARGS];
> };
>
> void fill_bpf_data(struct task_struct *t, struct pt_regs *r, struct bpf_data *d)
> {
> int i;
> unsigned long arg;
>
> d->syscall_nr = syscall_get_nr(t, r);
> for (i = 0; i < MAX_SC_ARGS; ++i){
> syscall_get_arguments(t, r, i, 1, &arg);
> d->arg_low[i] = arg;
> d->arg_high[i] = arg >> 32;
> }
> }
Sure, but it seems weird to keep the arguments high and low not
adjacent, but I realize you want an arch-independent interface. I
guess in that world, you could do this:
{
int32_t nr;
uint32_t arg32[MAX_SC_ARGS];
uint32_t arg64[MAX_SC_ARGS];
/* room for future expansion is here */
};
If no one else sees the benefit of keeping this out of the kernel,
then I can do this. I hope Oleg's idea works though because I think ti
addresses the performance problems even if it means the userspace
tooling work is higher to achieve good portability.
>>
>> Agreed -- except that, as I mentioned above, there are still
>> significant complexities kernel-side if anything other than regviews
>> are used.
>
> I'm missing what those complexities are.
>
>>
>>>>> This way the registers have to be interpreted only once by the kernel and all
>>>>> filtering programs don't have to do that mapping themselves. It also avoids
>>>>> doing unnecessary work fiddling/translating registers like the ptrace ABI does.
>>>>
>>>> The kernel does only interpret them once (after entry to
>>>> __secure_computing).
>>>
>>> Not if data shuffling is needed for compat related stuff.
>>
>> I agree! user_regs_struct get rid of the data shuffling. pt_regs and
>> syscall_get_arguments all seem to induced data shuffling for compat
>> junk. I just wish pt_regs was compat-width appropriate, but it makes
>> sense that a 64-bit kernel with a 32-bit program would use 64-bit
>> registers on its side. Just frustrating.
>
> Are user_regs_struct entries 32-bit for 32-bit tasks or is it 64-bit if
> the kernel is 64-bit? If they're 64-bit then you didn't get rid of the
> data shuffling.
They are appropriate to the process personality as I said and linked
to in the ptrace code. Take a look at my patch - no data shuffling is
needed.
>>>> It gets the regview and has it populate a
>>>> user_regs_struct. �All the register info is per-arch and matches
>>>> PTRACE_GETREGS, but not PTRACE_PEEKUSR.
>>>
>>> GETREGS seems to be a subset of PEEKUSR. That is, both start with
>>> a struct pt_regs/user_regs_struct (seems to be the same thing?)
>>
>>
>> Not quite. on x86-32, pt_regs and user_regs_struct are identical.
>> Power PC as well, I think. They diverge on pretty much every other
>> platform. Also, x86 compat has some trickiness. pt_regs is 64-bit on
>> x86-64 even with compat processes. Instead what happens is the naming
>> is kept if __KERNEL__ such that there aren't different struct member
>> names in all the syscall.h and ptrace code. The
>> IA32_EMULATION/TS_COMPAT stuff can then just use the reordered member
>> names without even more #ifdef madness.
>
> It was a surprise to me to find out that the pt_regs a 64-bit ptrace user
> gets for a 32 bit tracee differs from the pt_regs when both are 32 bits.
>
>> user_regs_struct will use the correct width according to the process
>> personality. On all arches with is_compat_task support, this matches
>> -- except x86. With x86, you can force a 32-bit syscall entry from a
>> 64-bit process resulting in a temporary setting of TS_COMPAT but with
>> a personality that is still 64-bit. This is an edge case and one I
>> think forcing compat and personality to not-change addresses.
>
> How's that possible? Setting CS to 0x23? Can userspace do that?
int 0x80 will do the trick or setting the CS which I believe can be done.
>>> PEEKUSR only has extra access to debugging registers.
>>
>> GETREGS uses a regview:
>> http://lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1153
>> PEEKUSR uses getreg or getreg32 directly (on x86). compat_arch_ptrace
>> on x86 will then grab a specified register based on the 32-bit offsets
>> out of a 64-bit pt_regs and can return any register offset, like
>> ORIG_EAX:
>> lxr.linux.no/linux+v3.2.1/arch/x86/kernel/ptrace.c#L1026
>>
>> It's this magic fixup that allows ptrace to just use pt_regs for
>> PEEKUSR while GETREGS is forced to do the full register copy.
>>
>>> That is another problem of giving a register view: Which registers
>>> are you going to give access to?
>>
>> Always the general set. This is the set denoted by core_note_type
>> NT_PRSTATUS on all architectures as far as I can tell.
>>
>>>> All the weird stuff is in
>>>> PEEKUSR to deal with the fact that compat pt_regs members are not
>>>> actually the same width as userspace would expect.
>>>>
>>>> If we populated an ABI as you've proposed, we'd at least need to
>>>> build that data set and give it syscall_get_arguments() output.
>>>
>>> Yes, but that's all you have to do, nothing more.
>>
>> I do even less now! :)
>
> You have to do more for the compat case.
No. Please look at the code.
>>> The pt_regs a 64 bit kernel builds for a 32 bit compat process is
>>> different than one from a 32 bit kernel, so you have to do some kind
>>> of data shuffling anyway.
>>
>> Yes - that's why I use user_regs_struct.
>
> But there are different versions of user_regs_struct, depending on the
> situation. This implies that the BPF filters have to be different too,
> while they could be exactly the same (except for the syscall nr).
Yes - per-arch. If you're already doing per-arch fixup, why ius
mapping 6 extra registers such a burden? If the code can't do that in
userspace, it is slacking.
>>> Worse, once you pick this ABI you're stuck with it and can't get rid
>>> of compat horrors like you have now with ptrace(). Do you really want
>>> to reuse an obscure ptrace ABI instead of creating a simpler new one?
>>
>> Exactly why I'm using user_regs_struct. I think we've been having
>> some cross-talk, but I'm not sure. The only edge case I can find with
>> user_regs_struct across all platforms is the nasty x86 32-bit entry
>> from 64-bit personality. Perhaps someday we can just nuke that path
>> :) But even so, it is tightly constrained by saving the personality
>> and compat flag in the filter program metadata and checking it at each
>> syscall.
>
> I think it's a good idea to nuke that path, it seems like a security hole
> waiting to happen.
Agreed. And yet we can't (yet?).
That is blatantly untrue. Have you ever used tcpdump's expression
language for filtering packets? Wireshark?
> And the more complicated you make it, the less likely it is that
> anyone will use this.
Unless there is a nice library that makes it work well.
>>>>> I missed if the original version was allowed to change the registers or not,
>>>>> if it is then perhaps the BPF program should set a specific flag after changing
>>>>> anything, to make it more explicit.
>>>>
>>>> Registers are const from the BPF perspective (just like with socket
>>>> filters). � Adding support for tracehook interception later could
>>>> allow for supervisor guided register mutation.
>>>
>>> If the ABI gives access to arguments instead of registers you don't have
>>> to do anything tricky: No security checks, no need for fixing up register
>>> values to their original value after the system call returns or any other
>>> subtleties. BPF filters can just change the values without side effects.
>>
>> BPF programs should never change any filters. BPF does not have the
>> capability to modify the data it is evaluating. Doing that would
>> require a BPF change and alter its very nature, imo.
>
> It could if you make the data part of the scratch memory. If you put the
> data at the top, just after BPF_MEMWORDS, then it's all compatible with
> the read-only version. Except the sk_chk_filter() code. But if you ever
> want to consolidate with the networking version, then you already need
> new non-byteswapping instructions. You can as well add a special modify
> instruction too then. Making it very explicit seems better anyway.
I am not going to go this route right now. If you want to, be my
guest. We can add BPF instructions later, but I am not going down that
rabbit hole now.
> Using BPF for system call filtering changes its very nature already.
No it doesn't. user_regs_struct becomes another data protocol.
> I must say that until your patch came up, I've never heard of BPF filters
> before. I think I'm going to use it in our ptrace jailer for network
> filtering, if it's possible to get the peer address for normal TCP/UDP
> sockets. Documentation is quite vague.
Cool!
>> While arguments seem tidy, we still end up with the nasty compat pain
>> and it is only worse kernel-side since there'd be no arch-independent
>> way to get the correct width system call arguments. I'd need to guess
>> they were 32-bit and downgrade them if compat. That or add a new arch
>> callout. Very fiddly :/
>
> See code above. It seems fairly tidy to me.
>
> You could also do the BPF filtering later in the system call entry path
> when the arguments are passed directly, but then it's harder to interact
> well with ptrace.
Where? Interpose it into each system call? The later I put it, the
less attack surface reduction I get. The whole point of this
framework is to reduce the kernel's attack surface by doing minimal
kernel-side work before making a policy decision.
Also, I've already gone the ftrace route. As I said in the writeup, I
don't think it is the right path for this sort of functionality.
>>
>>> I would prefer if it would work nicely with a ptrace supervisor, because
>>> to me it seems that if something can't be resolved in the BPF filter, more
>>> context and direct control is needed. The main downside of ptrace for
>>> jailing is its overhead (and some quirks). If that can be taken away for
>>> most system calls by using BPF then it would be useful for my use case.
>>
>> I could not agree more. I have a patch already in existence that adds
>> a call to tracehook_syscall_entry on failure under certain conditions,
>> but I didn't want to bog down discussion of the core feature with that
>> discussion too. I think supporting a ptrace supervisor would allow
>> for better debugging and sandbox development. (Then I think most of
>> the logic could move directly to BPF. E.g., only allow pointer
>> arguments for open() to live in this known read-only memory, etc.)
>
> That is very hard to do in practise except for very limited sandboxing
> cases. In the general case you want to check all paths, but knowing
> beforehand where those are stored is hard when running arbitrary stuff.
> And it doesn't guarantee that it are safe path, because they can start
> in the middle of a stored path and turn an absolute path into a relative
> one.
Yeah - you'd need a lookup table in the BPF, etc. It'd be pretty ugly :)
> And updating the filters on the run all the time is a hassle too. So
> I think most logic will stay out of BPF, especially because it is the
> more tricky stuff to do. But open() is not that performance critical
> compared to stuff that happens all the time and where you really don't
> want the ptrace overhead, like gettimeofday().
yeah definitely.
> By the way, I think you want the filter to decide with what error code
> the system call fails instead of hard coding it to EACCESS. So just use
> the return value instead of checking against regs_size, which doesn't
> make much sense anyway. Then you also have a way for the filter to tell
> whether the system call should be passed on to ptrace or not.
EACCES is never passed to userspace. As far as ptrace is concerned, I
don't think the filter needs any awareness of whether there is a
tracer or not.
That said, I'm happy to change the return value semantic. Right now
it matches how it works in the networking stack. It returns the data
size to be accepted.
What would make sense? 0 is success and any other value is a failure.
Then specify that the ABI failure return code is _some value_ and then
populate the other later? I was planning on doing that with
regs_size. reg_size is reserved, but the other return values could be
exported and used if they ever came into existence. I'm open to what
everyone thinks makes the most sense!
> Ideally, the BPF filter should be able to deny the system call with a
> specific error code, deny the call and kill the task, have a way to
> defer to ptrace, and a way to allow it.
Not happening (by my hand :). I'm not changing seccomp to allow it to
cause a system call to fail with an error code. I'll add support for
tracehook integration if this patch can get merged, but I'm not going
to change the basic semantics of seccomp. The nice thing is, if we
reserve return values, this functionality can be layered on later
without it causing any ABI breakage and with proper consideration
independent of whether the basic functionality gets merged. Then, if
you want retool the entire seccomp path on all architectures to allow
graceful system call failure, it'd be totally doable.
Thanks!
will
I think you're right, and checking which entry was used is better than
checking the cs register (since 64-bit code can use int80). That's
what I get for insufficiently careful reading of the assembly. (And
for going from memory from when I wrote the vsyscall emulation code --
that code is entered from a page fault, so the entry point used is
irrelevant.)
--Andy
Ouch, so a few issues:
- pt_regs isn't exported for most arches
- is_compat_task arches would need custom fixups
I think Indan takes this round :) I'll being integrating a
syscall_get_arguments approach. Hopefully it can be quite efficient.
cheers!
If this turns out to be expensive, it might be possible to break it up
and load the arguments on demand (and cache them); i.e. have
load_pointer() or similar notice when it is about to access something
other than bpf_data.syscall_nr.
-Kees
--
Kees Cook
ChromeOS Security
Makes perfect sense! In theory (as a few other people pointed this
out off list), it is entirely possible to never populate any data for
load_pointer except an optional cache. Just provide a custom
load_pointer that knows to take the offset return the syscall nr or
the args or some slice of the returned data.
This is even easier if the struct looks like:
struct {
int nr;
union {
uint32_t args32[6];
uint64_t args64[6];
}
};
since you can just use the offset without doing any endian-based
splitting. Another suggestion (thanks roland!) was to add
int syscall_arch;
to the struct populated with the AUDIT_ARCH_* defines. This would
help the case Indan was worried about -- portable filter programs.
It looks like there'd be some cross-arch plumbing to make the
AUDIT_ARCH_ data available, but not too bad.
Seem sane? I'm headed down this path now and I think it'll work out
assuming there aren't major objections to the syscall_arch piece.
thanks!
will
Hrm. I'm still not so sure about the arch bit. Without it, BPF
programs aren't directly share-able, but they could be as long as the
values for k and syscall numbers are being adapted. By putting arch
in the program, it makes it more likely that every system call will
have a bpf preamble that has to check the syscall_arch. It could
easily add 100s of nanoseconds to every call (on slower arches).
I'll probably do the next patch series without arch-checking support
then I can add if it is seems needed. Nothing forces a filter program
to check it, so it could be that we let the author make the decision.
cheers!
Wait: If a tasks is set to 64 bit mode, but calls into the kernel via
int 0x80 it's changed to 32 bit mode for that system call and back to
64 bit mode when the system call is finished!?
Our ptrace jailer is checking cs to figure out if a task is a compat task
or not, if the kernel can change that behind our back it means our jailer
isn't secure for x86_64 with compat enabled. Or is cs changed before the
ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
there another way?
I think this behaviour is so unexpected that it can only cause security
problems in the long run. Is anyone counting on this? Where is this
behaviour documented?
Greetings,
Indan
I don't know what your ptrace jailer does. But a task can switch
itself between 32-bit and 64-bit execution at will, and there's
nothing the kernel can do about it. (That isn't quite true -- in
theory the kernel could fiddle with the GDT, but that would be
expensive and wouldn't work on Xen.)
That being said, is_compat_task is apparently a good indication of
whether the current *syscall* entry is a 64-bit syscall or a 32-bit
syscall. Perhaps the function should be renamed to in_compat_syscall,
because that's what it does.
>
> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?
Nowhere, I think.
--Andy
Well, saying it like that suggests that there is more of a "mode change"
than really exists. It's simply that any task can use int $0x80 and
this always means using the 32-bit syscall table with TS_COMPAT set.
> Our ptrace jailer is checking cs to figure out if a task is a compat task
> or not, if the kernel can change that behind our back it means our jailer
> isn't secure for x86_64 with compat enabled. Or is cs changed before the
> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
> there another way?
I don't think there's another way. hpa and I once discussed adding a field
to the extractable "register state" that would say which method the syscall
in progress had taken to enter the kernel. That would tell you which
flavor of syscall instruction was used (or none, i.e. a trap/interrupt).
But nobody ever had a real need for it, and we didn't pursue it further.
(We originally talked about it in the context of distinguishing whether a
32-bit task had used sysenter or syscall or int $0x80, I think.)
> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?
It's documented the same place the entire Linux machine-level ABI is
documented, which is nowhere. Someone somewhere may once have been
counting on it. (The story I heard was about an implementation of valgrind
for 32-bit code that ran in 64-bit tasks, but I don't know for sure that it
was really done.) The general rule is that if it ever worked before in a
coherent way, we don't break binary compatibility.
In the implementation, it would require a special check to make it barf.
It's really just something that falls out of how the hardware and the
kernel implementation works. I suppose you could add such a check under a
new kconfig option that's marked as being potentially incompatible with
some old applications. Good luck with that.
True, the kernel always runs in 64-bit mode, it just selects which path
is taken.
>> Our ptrace jailer is checking cs to figure out if a task is a compat task
>> or not, if the kernel can change that behind our back it means our jailer
>> isn't secure for x86_64 with compat enabled. Or is cs changed before the
>> ptrace stuff and ptrace sees the "right" cs value? If not, we have to add
>> an expensive PTRACE_PEEKTEXT to check if it's an int 0x80 or not. Or is
>> there another way?
>
> I don't think there's another way. hpa and I once discussed adding a field
> to the extractable "register state" that would say which method the syscall
> in progress had taken to enter the kernel. That would tell you which
> flavor of syscall instruction was used (or none, i.e. a trap/interrupt).
> But nobody ever had a real need for it, and we didn't pursue it further.
> (We originally talked about it in the context of distinguishing whether a
> 32-bit task had used sysenter or syscall or int $0x80, I think.)
Argh. So strace and all other ptrace users will think the task is calling a
different system call than it executes, except if they check for int 0x80,
which I bet they don't.
I suppose I could cache the checked EIP-2's results, but then I also have to
check if the memory is read-only and invalide the cache when the mapping may
be changed. Probably not worth the complexity.
>> I think this behaviour is so unexpected that it can only cause security
>> problems in the long run. Is anyone counting on this? Where is this
>> behaviour documented?
>
> It's documented the same place the entire Linux machine-level ABI is
> documented, which is nowhere.
AMD wrote the "System V Application Binary Interface" which decribes
some Linux conventions. It's better than nothing. But it just mentions
'syscall', not what happens when int 0x80 is called anyway.
> Someone somewhere may once have been
> counting on it. (The story I heard was about an implementation of valgrind
> for 32-bit code that ran in 64-bit tasks, but I don't know for sure that it
> was really done.) The general rule is that if it ever worked before in a
> coherent way, we don't break binary compatibility.
Well, considering the code can't be sure if the kernel supports compat mode
at all, I think this case is getting even more obscure than it already is.
Disallowing it won't change the kernel behaviour compared to a kernel with
compat disabled.
What about disallowing this path when the task is being ptraced?
> In the implementation, it would require a special check to make it barf.
> It's really just something that falls out of how the hardware and the
> kernel implementation works. I suppose you could add such a check under a
> new kconfig option that's marked as being potentially incompatible with
> some old applications. Good luck with that.
That seems a hopeless path to follow, and won't solve my problem because
my code has to be able to run on all kernels. Half the point of using
ptrace for jailing was that it's mostly portable with no special kernel
support.
Greetings,
Indan
PTRACE_PEEKTEXT won't securely tell you if it's int 0x80 if there's
another thread modifying the code, or changing the mappings, or it's
executing from a file or shared memory that someone's writing to.
> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?
It's a surprise to me too. And like you I'm using ptrace, to trace
what a process touches, not restrict it, but it's subject to the same problem.
This looks like it needs a kernel patch.
-- Jamie
Every user program change it behind your back.
Your ptrace jailer isn't.
> I think this behaviour is so unexpected that it can only cause security
> problems in the long run. Is anyone counting on this? Where is this
> behaviour documented?
Look up far jumps in any x86 manual.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
> Every user program change it behind your back.
>
> Your ptrace jailer isn't.
I'm sorry but I can't read the above two lines without hearing Yoda's
voice. "Hmm hmm"
-- Steve
I'm pretty sure this isn't about changing cs or far jumps
I think Indan means code is running with 64-bit cs, but the kernel
treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
and there's no way for the ptracer to know which syscall the kernel
will perform, even by looking at all registers. It looks like a hole
in ptrace which could be fixed.
-- Jamie
He's assuming that code can only run on two code segments and
not arbitarily switch between them which is a completely incorrect
assumption.
> I think Indan means code is running with 64-bit cs, but the kernel
> treats int $0x80 as a 32-bit syscall and sysenter as a 64-bit syscall,
> and there's no way for the ptracer to know which syscall the kernel
> will perform, even by looking at all registers. It looks like a hole
> in ptrace which could be fixed.
Possibly, but anything that bases its security on ptrace is typically
unfixable racy (just think what happens with multiple threads
and syscall arguments), so it's unlikely to do any good.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
I think all he needs is to figure out which type of syscall was just
intercepted. (Obviously arguments in memory are a problem.)
We could possibly munge the "orig_ax" field to be different for the
int80 vs syscall cases. That's really the only field that isn't direct
x86 state. And it's 64 bits wide, but we really only care about the
low 32 bits in the kernel. So a bit in the high bits that says "this
was a int80 entry" would be possible.
Linus
That would be incompatible. However you could just add another virtual
register with such information (in fact I thought about that
when I did the compat code originally). However I don't think it'll salvage
the original broken by design ptrace jailer. And everyone else
so far has done fine without it.
-Andi
No it wouldn't.
We'd only do it for the case that everybody gets wrong: int80 from a
64-bit context.
All the other cases are trivial to see (look at CS to determine 32-bit
vs 64-bit system call) and are the common case.
So the one new "incompatible" bit case would be the case that existing
users would inevitably get wrong, so it can hardly be "incompatible".
Linus