Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

[PATCH] tracing: Add task_prctl_unknown tracepoint

5 views
Skip to first unread message

Marco Elver

unread,
Nov 5, 2024, 8:36:28 AM11/5/24
to el...@google.com, Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
prctl() is a complex syscall which multiplexes its functionality based
on a large set of PR_* options. Currently we count 64 such options. The
return value of unknown options is -EINVAL, and doesn't distinguish from
known options that were passed invalid args that also return -EINVAL.

To understand if programs are attempting to use prctl() options not yet
available on the running kernel, provide the task_prctl_unknown
tracepoint.

Note, this tracepoint is in an unlikely cold path, and would therefore
be suitable for continuous monitoring (e.g. via perf_event_open).

While the above is likely the simplest usecase, additionally this
tracepoint can help unlock some testing scenarios (where probing
sys_enter or sys_exit causes undesirable performance overheads):

a. unprivileged triggering of a test module: test modules may register a
probe to be called back on task_prctl_unknown, and pick a very large
unknown prctl() option upon which they perform a test function for an
unprivileged user;

b. unprivileged triggering of an eBPF program function: similar
as idea (a).

Example trace_pipe output:

<...>-366 [004] ..... 146.439400: task_prctl_unknown: pid=366 comm=a.out option=1234 arg2=101 arg3=102 arg4=103 arg5=104

Signed-off-by: Marco Elver <el...@google.com>
---
include/trace/events/task.h | 43 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 3 +++
2 files changed, 46 insertions(+)

diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index 47b527464d1a..ab711e581094 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -56,6 +56,49 @@ TRACE_EVENT(task_rename,
__entry->newcomm, __entry->oom_score_adj)
);

+/**
+ * task_prctl_unknown - called on unknown prctl() option
+ * @task: pointer to the current task
+ * @option: option passed
+ * @arg2: arg2 passed
+ * @arg3: arg3 passed
+ * @arg4: arg4 passed
+ * @arg5: arg5 passed
+ *
+ * Called on an unknown prctl() option.
+ */
+TRACE_EVENT(task_prctl_unknown,
+
+ TP_PROTO(struct task_struct *task, int option, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5),
+
+ TP_ARGS(task, option, arg2, arg3, arg4, arg5),
+
+ TP_STRUCT__entry(
+ __field( pid_t, pid )
+ __string( comm, task->comm )
+ __field( int, option)
+ __field( unsigned long, arg2)
+ __field( unsigned long, arg3)
+ __field( unsigned long, arg4)
+ __field( unsigned long, arg5)
+ ),
+
+ TP_fast_assign(
+ __entry->pid = task->pid;
+ __assign_str(comm);
+ __entry->option = option;
+ __entry->arg2 = arg2;
+ __entry->arg3 = arg3;
+ __entry->arg4 = arg4;
+ __entry->arg5 = arg5;
+ ),
+
+ TP_printk("pid=%d comm=%s option=%d arg2=%ld arg3=%ld arg4=%ld arg5=%ld",
+ __entry->pid, __get_str(comm), __entry->option,
+ __entry->arg2, __entry->arg3, __entry->arg4, __entry->arg5)
+);
+
#endif

/* This part must be outside protection */
diff --git a/kernel/sys.c b/kernel/sys.c
index 4da31f28fda8..dd0a71b68558 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -75,6 +75,8 @@
#include <asm/io.h>
#include <asm/unistd.h>

+#include <trace/events/task.h>
+
#include "uid16.h"

#ifndef SET_UNALIGN_CTL
@@ -2785,6 +2787,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
break;
default:
+ trace_task_prctl_unknown(me, option, arg2, arg3, arg4, arg5);
error = -EINVAL;
break;
}
--
2.47.0.199.ga7371fff76-goog

Steven Rostedt

unread,
Nov 5, 2024, 11:31:14 AM11/5/24
to Marco Elver, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Tue, 5 Nov 2024 14:34:05 +0100
Marco Elver <el...@google.com> wrote:

> prctl() is a complex syscall which multiplexes its functionality based
> on a large set of PR_* options. Currently we count 64 such options. The
> return value of unknown options is -EINVAL, and doesn't distinguish from
> known options that were passed invalid args that also return -EINVAL.
>
> To understand if programs are attempting to use prctl() options not yet
> available on the running kernel, provide the task_prctl_unknown
> tracepoint.
>
> Note, this tracepoint is in an unlikely cold path, and would therefore
> be suitable for continuous monitoring (e.g. via perf_event_open).
>
> While the above is likely the simplest usecase, additionally this
> tracepoint can help unlock some testing scenarios (where probing
> sys_enter or sys_exit causes undesirable performance overheads):
>
> a. unprivileged triggering of a test module: test modules may register a
> probe to be called back on task_prctl_unknown, and pick a very large
> unknown prctl() option upon which they perform a test function for an
> unprivileged user;
>
> b. unprivileged triggering of an eBPF program function: similar
> as idea (a).
>
> Example trace_pipe output:
>
> <...>-366 [004] ..... 146.439400: task_prctl_unknown: pid=366 comm=a.out option=1234 arg2=101 arg3=102 arg4=103 arg5=104

^^^ ^^^
Why record the pid that is already recorded by the event header?

> + __string( comm, task->comm )

I'm also surprised that the comm didn't show in the trace_pipe. I've
updated the code so that it should usually find it. But saving it here may
not be a big deal.

-- Steve

Marco Elver

unread,
Nov 5, 2024, 11:54:32 AM11/5/24
to Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
To keep in style with the other "task" tracepoints above. I can
certainly do without - it does seem unnecessary.

To cleanup, do we want to remove "pid=" from the other tracepoints in
this file as well (in another patch). Or does this potentially break
existing users?

> > + __string( comm, task->comm )
>
> I'm also surprised that the comm didn't show in the trace_pipe.

Any config options or tweaks needed to get it to show more reliably?

> I've
> updated the code so that it should usually find it. But saving it here may
> not be a big deal.

Thanks,
-- Marco

Steven Rostedt

unread,
Nov 5, 2024, 12:02:50 PM11/5/24
to Marco Elver, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
Hmm, new_task, pid is different than the creator. But rename is pointless
to record pid. I would get rid of it here, especially since it also creates
a hole in the event (three int fields followed by a long).

>
> To cleanup, do we want to remove "pid=" from the other tracepoints in
> this file as well (in another patch). Or does this potentially break
> existing users?

We can't from task_newtask as that's the pid of the task that's being
created. In other words, it's very relevant. The task_rename could have its
pid field dropped.

>
> > > + __string( comm, task->comm )
> >
> > I'm also surprised that the comm didn't show in the trace_pipe.
>
> Any config options or tweaks needed to get it to show more reliably?
>
> > I've
> > updated the code so that it should usually find it. But saving it here may
> > not be a big deal.

How did you start it? Because it appears reliable for me.

-- Steve

Marco Elver

unread,
Nov 5, 2024, 12:23:16 PM11/5/24
to Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
Ack - will do.

> >
> > > > + __string( comm, task->comm )
> > >
> > > I'm also surprised that the comm didn't show in the trace_pipe.
> >
> > Any config options or tweaks needed to get it to show more reliably?
> >
> > > I've
> > > updated the code so that it should usually find it. But saving it here may
> > > not be a big deal.
>
> How did you start it? Because it appears reliable for me.

Very normally from bash. Maybe my env is broken in other ways, I'll
dig a little.

Marco Elver

unread,
Nov 6, 2024, 4:22:54 AM11/6/24
to Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Tue, 5 Nov 2024 at 18:22, Marco Elver <el...@google.com> wrote:
...
> > > > I'm also surprised that the comm didn't show in the trace_pipe.
> > >
> > > Any config options or tweaks needed to get it to show more reliably?
> > >
> > > > I've
> > > > updated the code so that it should usually find it. But saving it here may
> > > > not be a big deal.
> >
> > How did you start it? Because it appears reliable for me.
>
> Very normally from bash. Maybe my env is broken in other ways, I'll
> dig a little.

Some trial and error led me to conclude it's a race between the logic
looking up the comm and the process exiting: If the test program exits
soon after the traced event, it doesn't print the comm. Adding a
generous usleep() before it exits reliably prints the comm.

Steven Rostedt

unread,
Nov 6, 2024, 10:18:24 AM11/6/24
to Marco Elver, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
Thanks for letting me know. Let me see if I can fix that!

-- Steve

Steven Rostedt

unread,
Nov 6, 2024, 10:28:57 AM11/6/24
to Marco Elver, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Wed, 6 Nov 2024 10:18:23 -0500
Steven Rostedt <ros...@goodmis.org> wrote:

> > Some trial and error led me to conclude it's a race between the logic
> > looking up the comm and the process exiting: If the test program exits
> > soon after the traced event, it doesn't print the comm. Adding a
> > generous usleep() before it exits reliably prints the comm.
>
> Thanks for letting me know. Let me see if I can fix that!

Hmm, that still doesn't make sense. Is this just a single line or do you
have other events being recorded?

The way the caching works is during the sched_switch tracepoint which still
gets called when the task exits. If a trace event is triggered, it sets a
per cpu flags to have the next sched_switch record the comm for both the
previous and next tasks.

Now the reason it can miss is that there's contention on the lock that
saves the comms (it does a trylock and if it fails, it just skips it). Or
if another task hits the same "comm cache line".

This is why I wonder if you have other events being traced.

-- Steve

Marco Elver

unread,
Nov 6, 2024, 1:12:53 PM11/6/24
to Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
No other events should be traced. This is the test program I've used:

#include <sys/prctl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
prctl(1234, 101, 102, 103, 104);
if (argc > 1)
usleep(1000);
return 0;
}

Kernel config is x86_64 default + CONFIG_FUNCTION_TRACER=y +
CONFIG_FTRACE_SYSCALLS=y. For the test, once booted all I do is:

% echo 1 > /sys/kernel/debug/tracing/events/task/task_prctl_unknown/enable
% cat /sys/kernel/debug/tracing/trace_pipe
... wait for output ...

That's pretty much it. I've attached my kernel config just in case I
missed something.
.config

Steven Rostedt

unread,
Nov 6, 2024, 4:23:43 PM11/6/24
to Marco Elver, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Wed, 6 Nov 2024 19:12:42 +0100
Marco Elver <el...@google.com> wrote:

> No other events should be traced. This is the test program I've used:
>
> #include <sys/prctl.h>
> #include <unistd.h>
>
> int main(int argc, char *argv[])
> {
> prctl(1234, 101, 102, 103, 104);
> if (argc > 1)
> usleep(1000);
> return 0;
> }
>
> Kernel config is x86_64 default + CONFIG_FUNCTION_TRACER=y +
> CONFIG_FTRACE_SYSCALLS=y. For the test, once booted all I do is:
>
> % echo 1 > /sys/kernel/debug/tracing/events/task/task_prctl_unknown/enable
> % cat /sys/kernel/debug/tracing/trace_pipe
> ... wait for output ...
>
> That's pretty much it. I've attached my kernel config just in case I
> missed something.

OK, it's because you are using trace_pipe (which by the way should not be
used for anything serious). The read of trace_pipe flushes the buffer
before the task is scheduled out and the comm saved, so it prints the
"<...>". If you instead do the cat of trace_pipe *after* running the
command, you'll see the comm.

So this is just because you are using the obsolete trace_pipe.

-- Steve

Marco Elver

unread,
Nov 6, 2024, 5:00:06 PM11/6/24
to Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Wed, 6 Nov 2024 at 22:23, Steven Rostedt <ros...@goodmis.org> wrote:
...
> > That's pretty much it. I've attached my kernel config just in case I
> > missed something.
>
> OK, it's because you are using trace_pipe (which by the way should not be
> used for anything serious). The read of trace_pipe flushes the buffer
> before the task is scheduled out and the comm saved, so it prints the
> "<...>". If you instead do the cat of trace_pipe *after* running the
> command, you'll see the comm.
>
> So this is just because you are using the obsolete trace_pipe.

I see, thanks for clarifying. I always felt for quick testing it
serves its purpose - anything equally simple you recommend for testing
but doesn't suffer from this problem?

Thanks,
-- Marco

Steven Rostedt

unread,
Nov 6, 2024, 5:30:24 PM11/6/24
to Marco Elver, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
You can run trace-cmd, or cat trace after the run.

-- Steve

Marco Elver

unread,
Nov 7, 2024, 7:26:55 AM11/7/24
to el...@google.com, Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
prctl() is a complex syscall which multiplexes its functionality based
on a large set of PR_* options. Currently we count 64 such options. The
return value of unknown options is -EINVAL, and doesn't distinguish from
known options that were passed invalid args that also return -EINVAL.

To understand if programs are attempting to use prctl() options not yet
available on the running kernel, provide the task_prctl_unknown
tracepoint.

Note, this tracepoint is in an unlikely cold path, and would therefore
be suitable for continuous monitoring (e.g. via perf_event_open).

While the above is likely the simplest usecase, additionally this
tracepoint can help unlock some testing scenarios (where probing
sys_enter or sys_exit causes undesirable performance overheads):

a. unprivileged triggering of a test module: test modules may register a
probe to be called back on task_prctl_unknown, and pick a very large
unknown prctl() option upon which they perform a test function for an
unprivileged user;

b. unprivileged triggering of an eBPF program function: similar
as idea (a).

Example trace_pipe output:

test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104

Signed-off-by: Marco Elver <el...@google.com>
---
v2:
* Remove "pid" in trace output (suggested by Steven).
---
include/trace/events/task.h | 41 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 3 +++
2 files changed, 44 insertions(+)

diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index 47b527464d1a..9202cb2524c4 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -56,6 +56,47 @@ TRACE_EVENT(task_rename,
__entry->newcomm, __entry->oom_score_adj)
);

+/**
+ * task_prctl_unknown - called on unknown prctl() option
+ * @task: pointer to the current task
+ * @option: option passed
+ * @arg2: arg2 passed
+ * @arg3: arg3 passed
+ * @arg4: arg4 passed
+ * @arg5: arg5 passed
+ *
+ * Called on an unknown prctl() option.
+ */
+TRACE_EVENT(task_prctl_unknown,
+
+ TP_PROTO(struct task_struct *task, int option, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5),
+
+ TP_ARGS(task, option, arg2, arg3, arg4, arg5),
+
+ TP_STRUCT__entry(
+ __string( comm, task->comm )
+ __field( int, option)
+ __field( unsigned long, arg2)
+ __field( unsigned long, arg3)
+ __field( unsigned long, arg4)
+ __field( unsigned long, arg5)
+ ),
+
+ TP_fast_assign(
+ __assign_str(comm);
+ __entry->option = option;
+ __entry->arg2 = arg2;
+ __entry->arg3 = arg3;
+ __entry->arg4 = arg4;
+ __entry->arg5 = arg5;
+ ),
+
+ TP_printk("comm=%s option=%d arg2=%ld arg3=%ld arg4=%ld arg5=%ld",
+ __get_str(comm), __entry->option,
+ __entry->arg2, __entry->arg3, __entry->arg4, __entry->arg5)
+);
+
#endif

/* This part must be outside protection */
diff --git a/kernel/sys.c b/kernel/sys.c
index 4da31f28fda8..dd0a71b68558 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -75,6 +75,8 @@
#include <asm/io.h>
#include <asm/unistd.h>

+#include <trace/events/task.h>
+
#include "uid16.h"

#ifndef SET_UNALIGN_CTL
@@ -2785,6 +2787,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
break;
default:
+ trace_task_prctl_unknown(me, option, arg2, arg3, arg4, arg5);
error = -EINVAL;
break;
}
--
2.47.0.199.ga7371fff76-goog

Marco Elver

unread,
Nov 7, 2024, 7:26:59 AM11/7/24
to el...@google.com, Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
Remove pid in task_rename tracepoint output, since that tracepoint only
deals with the current task, and is printed by default. This also saves
some space in the entry and avoids wasted padding.

Link: https://lkml.kernel.org/r/20241105120...@gandalf.local.home
Suggested-by: Steven Rostedt <ros...@goodmis.org>
Signed-off-by: Marco Elver <el...@google.com>
---
v2:
* New patch
---
include/trace/events/task.h | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index 9202cb2524c4..ee202aafa9fd 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -38,22 +38,19 @@ TRACE_EVENT(task_rename,
TP_ARGS(task, comm),

TP_STRUCT__entry(
- __field( pid_t, pid)
__array( char, oldcomm, TASK_COMM_LEN)
__array( char, newcomm, TASK_COMM_LEN)
__field( short, oom_score_adj)
),

TP_fast_assign(
- __entry->pid = task->pid;
memcpy(entry->oldcomm, task->comm, TASK_COMM_LEN);
strscpy(entry->newcomm, comm, TASK_COMM_LEN);
__entry->oom_score_adj = task->signal->oom_score_adj;
),

- TP_printk("pid=%d oldcomm=%s newcomm=%s oom_score_adj=%hd",
- __entry->pid, __entry->oldcomm,
- __entry->newcomm, __entry->oom_score_adj)
+ TP_printk("oldcomm=%s newcomm=%s oom_score_adj=%hd",
+ __entry->oldcomm, __entry->newcomm, __entry->oom_score_adj)
);

/**
--
2.47.0.199.ga7371fff76-goog

Steven Rostedt

unread,
Nov 7, 2024, 10:34:09 AM11/7/24
to Marco Elver, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Thu, 7 Nov 2024 13:25:47 +0100
Marco Elver <el...@google.com> wrote:

> +/**
> + * task_prctl_unknown - called on unknown prctl() option
> + * @task: pointer to the current task
> + * @option: option passed
> + * @arg2: arg2 passed
> + * @arg3: arg3 passed
> + * @arg4: arg4 passed
> + * @arg5: arg5 passed
> + *
> + * Called on an unknown prctl() option.
> + */
> +TRACE_EVENT(task_prctl_unknown,
> +
> + TP_PROTO(struct task_struct *task, int option, unsigned long arg2, unsigned long arg3,
> + unsigned long arg4, unsigned long arg5),
> +
> + TP_ARGS(task, option, arg2, arg3, arg4, arg5),
> +
> + TP_STRUCT__entry(
> + __string( comm, task->comm )

The question is, do we really need comm? From your example, it's redundant:

test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104
^^^^ ^^^^

-- Steve

Mathieu Desnoyers

unread,
Nov 7, 2024, 10:45:44 AM11/7/24
to Marco Elver, Steven Rostedt, Kees Cook, Masami Hiramatsu, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On 2024-11-07 07:25, Marco Elver wrote:
> prctl() is a complex syscall which multiplexes its functionality based
> on a large set of PR_* options. Currently we count 64 such options. The
> return value of unknown options is -EINVAL, and doesn't distinguish from
> known options that were passed invalid args that also return -EINVAL.
>
> To understand if programs are attempting to use prctl() options not yet
> available on the running kernel, provide the task_prctl_unknown
> tracepoint.
>
> Note, this tracepoint is in an unlikely cold path, and would therefore
> be suitable for continuous monitoring (e.g. via perf_event_open).
>
> While the above is likely the simplest usecase, additionally this
> tracepoint can help unlock some testing scenarios (where probing
> sys_enter or sys_exit causes undesirable performance overheads):
>
> a. unprivileged triggering of a test module: test modules may register a
> probe to be called back on task_prctl_unknown, and pick a very large
> unknown prctl() option upon which they perform a test function for an
> unprivileged user;
>
> b. unprivileged triggering of an eBPF program function: similar
> as idea (a).
>
> Example trace_pipe output:
>
> test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104
>

My concern is that we start adding tons of special-case
tracepoints to the implementation of system calls which
are redundant with the sys_enter/exit tracepoints.

Why favor this approach rather than hooking on sys_enter/exit ?

Thanks,

Mathieu
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Marco Elver

unread,
Nov 7, 2024, 10:47:26 AM11/7/24
to Mathieu Desnoyers, Steven Rostedt, Kees Cook, Masami Hiramatsu, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
It's __extremely__ expensive when deployed at scale. See note in
commit description above.

Marco Elver

unread,
Nov 7, 2024, 10:47:36 AM11/7/24
to Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Thu, 7 Nov 2024 at 16:34, Steven Rostedt <ros...@goodmis.org> wrote:
...
> > + TP_PROTO(struct task_struct *task, int option, unsigned long arg2, unsigned long arg3,
> > + unsigned long arg4, unsigned long arg5),
> > +
> > + TP_ARGS(task, option, arg2, arg3, arg4, arg5),
> > +
> > + TP_STRUCT__entry(
> > + __string( comm, task->comm )
>
> The question is, do we really need comm? From your example, it's redundant:
>
> test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104
> ^^^^ ^^^^

Ack, let's remove it. I will also remove the "task" argument.

Steven Rostedt

unread,
Nov 7, 2024, 10:52:09 AM11/7/24
to Marco Elver, Mathieu Desnoyers, Kees Cook, Masami Hiramatsu, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Thu, 7 Nov 2024 16:46:47 +0100
Marco Elver <el...@google.com> wrote:

> > My concern is that we start adding tons of special-case
> > tracepoints to the implementation of system calls which
> > are redundant with the sys_enter/exit tracepoints.
> >
> > Why favor this approach rather than hooking on sys_enter/exit ?
>
> It's __extremely__ expensive when deployed at scale. See note in
> commit description above.

Agreed. The sys_enter/exit trace events make all syscalls go the slow path,
which can be quite expensive.

-- Steve

Mathieu Desnoyers

unread,
Nov 7, 2024, 10:54:11 AM11/7/24
to Marco Elver, Steven Rostedt, Kees Cook, Masami Hiramatsu, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
I suspect you base the overhead analysis on the x86-64 implementation
of sys_enter/exit tracepoint and especially the overhead caused by
the SYSCALL_WORK_SYSCALL_TRACEPOINT thread flag, am I correct ?

If that is causing a too large overhead, we should investigate if
those can be improved instead of adding tracepoints in the
implementation of system calls.

Thanks,

Mathieu

Marco Elver

unread,
Nov 7, 2024, 10:58:20 AM11/7/24
to Mathieu Desnoyers, Steven Rostedt, Kees Cook, Masami Hiramatsu, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Thu, 7 Nov 2024 at 16:54, Mathieu Desnoyers
Doing that may be generally useful, but even if you improve it
somehow, there's always some additional bit of work needed on
sys_enter/exit as soon as a tracepoint is attached. Even if that's
just a few cycles, it's too much (for me at least).

Also: if you just hook sys_enter/exit, you don't know if the prctl was
handled or not by inspecting the return code (-EINVAL). I want the
kernel to tell me if it handled the prctl() or not, and I also think
it's very bad design to copy-paste the prctl() option checking of the
running kernel in a sys_enter/exit hook. This doesn't scale in terms
of performance nor maintainability.

Steven Rostedt

unread,
Nov 7, 2024, 11:04:15 AM11/7/24
to Mathieu Desnoyers, Marco Elver, Kees Cook, Masami Hiramatsu, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Thu, 7 Nov 2024 10:52:37 -0500
Mathieu Desnoyers <mathieu....@efficios.com> wrote:

> I suspect you base the overhead analysis on the x86-64 implementation
> of sys_enter/exit tracepoint and especially the overhead caused by
> the SYSCALL_WORK_SYSCALL_TRACEPOINT thread flag, am I correct ?
>
> If that is causing a too large overhead, we should investigate if
> those can be improved instead of adding tracepoints in the
> implementation of system calls.

That would be great to get better, but the reason I'm not against this
patch is because prctl() is not a normal system call. It's basically an
ioctl() for Linux, and very vague. It's basically the garbage system call
when you don't know what to do. It's even being proposed for the sframe
work.

I understand your sentiment and agree. I don't want any random system call
to get a tracepoint attached to it. But here I'd make an exception.

-- Steve

Mathieu Desnoyers

unread,
Nov 7, 2024, 11:37:33 AM11/7/24
to Steven Rostedt, Marco Elver, Kees Cook, Masami Hiramatsu, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
Should we document this as an "instrumentation good practice" then ?

When the system call is a multiplexor such as ioctl(2) and prctl(2),
then instrumenting it with tracepoints within each of the "op" case
makes sense for overall maintainability.

For non-multiplexor system calls, using the existing sys_enter/exit
tracepoints should be favored.

This opens the following question for non-multiplexors system calls:
considering that the overhead of the current sys_enter/exit
instrumentation is deemed to large to use in production, perhaps
we should consider a few alternatives, namely:

A) Modify SYSCALL_DEFINE so it emits a function wrapper with tracepoints
for each system call enter/exit, except for multiplexors, or

B) Add the plumbing required to allow system call tracing to be
activated for specific system calls only, more fine-grained than
the current system-wide for_each_process_thread()
SYSCALL_WORK_SYSCALL_TRACEPOINT thread flag big hammer.

Another scenario to consider is system calls that have iovec arguments.
Should we add tracepoint within the iovec iteration, or should it target
the entire iovec as input/output at system call enter/exit ?

Marco Elver

unread,
Nov 8, 2024, 6:35:06 AM11/8/24
to el...@google.com, Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
prctl() is a complex syscall which multiplexes its functionality based
on a large set of PR_* options. Currently we count 64 such options. The
return value of unknown options is -EINVAL, and doesn't distinguish from
known options that were passed invalid args that also return -EINVAL.

To understand if programs are attempting to use prctl() options not yet
available on the running kernel, provide the task_prctl_unknown
tracepoint.

Note, this tracepoint is in an unlikely cold path, and would therefore
be suitable for continuous monitoring (e.g. via perf_event_open).

While the above is likely the simplest usecase, additionally this
tracepoint can help unlock some testing scenarios (where probing
sys_enter or sys_exit causes undesirable performance overheads):

a. unprivileged triggering of a test module: test modules may register a
probe to be called back on task_prctl_unknown, and pick a very large
unknown prctl() option upon which they perform a test function for an
unprivileged user;

b. unprivileged triggering of an eBPF program function: similar
as idea (a).

Example trace_pipe output:

test-380 [001] ..... 78.142904: task_prctl_unknown: option=1234 arg2=101 arg3=102 arg4=103 arg5=104

Signed-off-by: Marco Elver <el...@google.com>
---
v3:
* Remove "comm".

v2:
* Remove "pid" in trace output (suggested by Steven).
---
include/trace/events/task.h | 37 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 3 +++
2 files changed, 40 insertions(+)

diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index 47b527464d1a..209d315852fb 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -56,6 +56,43 @@ TRACE_EVENT(task_rename,
__entry->newcomm, __entry->oom_score_adj)
);

+/**
+ * task_prctl_unknown - called on unknown prctl() option
+ * @option: option passed
+ * @arg2: arg2 passed
+ * @arg3: arg3 passed
+ * @arg4: arg4 passed
+ * @arg5: arg5 passed
+ *
+ * Called on an unknown prctl() option.
+ */
+TRACE_EVENT(task_prctl_unknown,
+
+ TP_PROTO(int option, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5),
+
+ TP_ARGS(option, arg2, arg3, arg4, arg5),
+
+ TP_STRUCT__entry(
+ __field( int, option)
+ __field( unsigned long, arg2)
+ __field( unsigned long, arg3)
+ __field( unsigned long, arg4)
+ __field( unsigned long, arg5)
+ ),
+
+ TP_fast_assign(
+ __entry->option = option;
+ __entry->arg2 = arg2;
+ __entry->arg3 = arg3;
+ __entry->arg4 = arg4;
+ __entry->arg5 = arg5;
+ ),
+
+ TP_printk("option=%d arg2=%ld arg3=%ld arg4=%ld arg5=%ld",
+ __entry->option, __entry->arg2, __entry->arg3, __entry->arg4, __entry->arg5)
+);
+
#endif

/* This part must be outside protection */
diff --git a/kernel/sys.c b/kernel/sys.c
index 4da31f28fda8..b366cef102ec 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -75,6 +75,8 @@
#include <asm/io.h>
#include <asm/unistd.h>

+#include <trace/events/task.h>
+
#include "uid16.h"

#ifndef SET_UNALIGN_CTL
@@ -2785,6 +2787,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3);
break;
default:
+ trace_task_prctl_unknown(option, arg2, arg3, arg4, arg5);
error = -EINVAL;
break;
}
--
2.47.0.277.g8800431eea-goog

Marco Elver

unread,
Nov 8, 2024, 6:35:10 AM11/8/24
to el...@google.com, Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
Remove pid in task_rename tracepoint output, since that tracepoint only
deals with the current task, and is printed by default. This also saves
some space in the entry and avoids wasted padding.

Link: https://lkml.kernel.org/r/20241105120...@gandalf.local.home
Suggested-by: Steven Rostedt <ros...@goodmis.org>
Signed-off-by: Marco Elver <el...@google.com>
---
v2:
* New patch
---
include/trace/events/task.h | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index 209d315852fb..af535b053033 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -38,22 +38,19 @@ TRACE_EVENT(task_rename,
TP_ARGS(task, comm),

TP_STRUCT__entry(
- __field( pid_t, pid)
__array( char, oldcomm, TASK_COMM_LEN)
__array( char, newcomm, TASK_COMM_LEN)
__field( short, oom_score_adj)
),

TP_fast_assign(
- __entry->pid = task->pid;
memcpy(entry->oldcomm, task->comm, TASK_COMM_LEN);
strscpy(entry->newcomm, comm, TASK_COMM_LEN);
__entry->oom_score_adj = task->signal->oom_score_adj;
),

- TP_printk("pid=%d oldcomm=%s newcomm=%s oom_score_adj=%hd",
- __entry->pid, __entry->oldcomm,
- __entry->newcomm, __entry->oom_score_adj)
+ TP_printk("oldcomm=%s newcomm=%s oom_score_adj=%hd",
+ __entry->oldcomm, __entry->newcomm, __entry->oom_score_adj)
);

/**
--
2.47.0.277.g8800431eea-goog

Marco Elver

unread,
Nov 15, 2024, 7:00:40 AM11/15/24
to el...@google.com, Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Fri, 8 Nov 2024 at 12:35, Marco Elver <el...@google.com> wrote:
>
> prctl() is a complex syscall which multiplexes its functionality based
> on a large set of PR_* options. Currently we count 64 such options. The
> return value of unknown options is -EINVAL, and doesn't distinguish from
> known options that were passed invalid args that also return -EINVAL.
>
> To understand if programs are attempting to use prctl() options not yet
> available on the running kernel, provide the task_prctl_unknown
> tracepoint.
>
> Note, this tracepoint is in an unlikely cold path, and would therefore
> be suitable for continuous monitoring (e.g. via perf_event_open).
>
> While the above is likely the simplest usecase, additionally this
> tracepoint can help unlock some testing scenarios (where probing
> sys_enter or sys_exit causes undesirable performance overheads):
>
> a. unprivileged triggering of a test module: test modules may register a
> probe to be called back on task_prctl_unknown, and pick a very large
> unknown prctl() option upon which they perform a test function for an
> unprivileged user;
>
> b. unprivileged triggering of an eBPF program function: similar
> as idea (a).
>
> Example trace_pipe output:
>
> test-380 [001] ..... 78.142904: task_prctl_unknown: option=1234 arg2=101 arg3=102 arg4=103 arg5=104
>
> Signed-off-by: Marco Elver <el...@google.com>

Steven, unless there are any further objections, would you be able to
take this through the tracing tree?

Many thanks!

Steven Rostedt

unread,
Nov 15, 2024, 8:27:17 AM11/15/24
to Marco Elver, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Fri, 15 Nov 2024 13:00:00 +0100
Marco Elver <el...@google.com> wrote:

> Steven, unless there are any further objections, would you be able to
> take this through the tracing tree?
>
> Many thanks!

This isn't my file. Trace events usually belong to the subsystems that
use them. As this adds an event to kernel/sys.c which doesn't really have
an owner, then I would ask Andrew Morton to take it.

-- Steve

Marco Elver

unread,
Nov 15, 2024, 10:06:50 AM11/15/24
to Steven Rostedt, Andrew Morton, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
Got it.

Andrew, can you pick this up?

Thanks,
-- Marco

Marco Elver

unread,
Dec 11, 2024, 9:09:47 AM12/11/24
to Andrew Morton, Kees Cook, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
Gentle ping - many thanks,
-- Marco

Alexander Potapenko

unread,
Dec 12, 2024, 5:06:23 AM12/12/24
to Marco Elver, Steven Rostedt, Kees Cook, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Fri, Nov 8, 2024 at 12:35 PM 'Marco Elver' via kasan-dev
<kasa...@googlegroups.com> wrote:
>
> prctl() is a complex syscall which multiplexes its functionality based
> on a large set of PR_* options. Currently we count 64 such options. The
> return value of unknown options is -EINVAL, and doesn't distinguish from
> known options that were passed invalid args that also return -EINVAL.
>
> To understand if programs are attempting to use prctl() options not yet
> available on the running kernel, provide the task_prctl_unknown
> tracepoint.
>
> Note, this tracepoint is in an unlikely cold path, and would therefore
> be suitable for continuous monitoring (e.g. via perf_event_open).
>
> While the above is likely the simplest usecase, additionally this
> tracepoint can help unlock some testing scenarios (where probing
> sys_enter or sys_exit causes undesirable performance overheads):
>
> a. unprivileged triggering of a test module: test modules may register a
> probe to be called back on task_prctl_unknown, and pick a very large
> unknown prctl() option upon which they perform a test function for an
> unprivileged user;
>
> b. unprivileged triggering of an eBPF program function: similar
> as idea (a).
>
> Example trace_pipe output:
>
> test-380 [001] ..... 78.142904: task_prctl_unknown: option=1234 arg2=101 arg3=102 arg4=103 arg5=104

For what it's worth:

> Signed-off-by: Marco Elver <el...@google.com>
Reviewed-by: Alexander Potapenko <gli...@google.com>

Kees Cook

unread,
Dec 16, 2024, 7:38:37 PM12/16/24
to Steven Rostedt, Kees Cook, Marco Elver, Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton, Oleg Nesterov, linux-...@vger.kernel.org, linux-tra...@vger.kernel.org, Dmitry Vyukov, kasa...@googlegroups.com
On Fri, 08 Nov 2024 12:34:24 +0100, Marco Elver wrote:
> prctl() is a complex syscall which multiplexes its functionality based
> on a large set of PR_* options. Currently we count 64 such options. The
> return value of unknown options is -EINVAL, and doesn't distinguish from
> known options that were passed invalid args that also return -EINVAL.
>
> To understand if programs are attempting to use prctl() options not yet
> available on the running kernel, provide the task_prctl_unknown
> tracepoint.
>
> [...]

Applied to for-next/hardening, thanks!

[1/2] tracing: Add task_prctl_unknown tracepoint
https://git.kernel.org/kees/c/57a6baf3a3ea
[2/2] tracing: Remove pid in task_rename tracing output
https://git.kernel.org/kees/c/a6115cceb1dd

Take care,

--
Kees Cook

Reply all
Reply to author
Forward
0 new messages