[PATCH] perf/core: Add a tracepoint for perf sampling

Brendan Gregg

unread,

Jul 19, 2016, 7:30:07 PM7/19/16

to

When perf is performing hrtimer-based sampling, this tracepoint can be used
by BPF to run additional logic on each sample. For example, BPF can fetch
stack traces and frequency count them in kernel context, for an efficient
profiler.

Signed-off-by: Brendan Gregg <bgr...@netflix.com>
Cc: Alexei Starovoitov <a...@kernel.org>
Cc: Wang Nan <wang...@huawei.com>
---
include/trace/events/perf.h | 29 +++++++++++++++++++++++++++++
kernel/events/core.c | 5 +++++
2 files changed, 34 insertions(+)
create mode 100644 include/trace/events/perf.h

diff --git a/include/trace/events/perf.h b/include/trace/events/perf.h
new file mode 100644
index 0000000..461770d
--- /dev/null
+++ b/include/trace/events/perf.h
@@ -0,0 +1,29 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM perf
+
+#if !defined(_TRACE_PERF_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PERF_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(perf_hrtimer,
+ TP_PROTO(struct pt_regs *regs, struct perf_event *event),
+
+ TP_ARGS(regs, event),
+
+ TP_STRUCT__entry(
+ __field(struct pt_regs *, regs)
+ __field(struct perf_event *, event)
+ ),
+
+ TP_fast_assign(
+ __entry->regs = regs;
+ __entry->event = event;
+ ),
+
+ TP_printk("regs=%p evt=%p", __entry->regs, __entry->event)
+);
+#endif /* _TRACE_PERF_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 79dae18..0d843a7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -51,6 +51,9 @@

#include <asm/irq_regs.h>

+#define CREATE_TRACE_POINTS
+#include <trace/events/perf.h>
+
typedef int (*remote_function_f)(void *);

struct remote_function_call {
@@ -8036,6 +8039,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
perf_sample_data_init(&data, 0, event->hw.last_period);
regs = get_irq_regs();

+ trace_perf_hrtimer(regs, event);
+
if (regs && !perf_exclude_event(event, regs)) {
if (!(event->attr.exclude_idle && is_idle_task(current)))
if (__perf_event_overflow(event, 1, &data, regs))
--
2.7.4

Brendan Gregg

unread,

Jul 29, 2016, 2:10:06 PM7/29/16

to

On Tue, Jul 19, 2016 at 4:20 PM, Brendan Gregg <bgr...@netflix.com> wrote:
> When perf is performing hrtimer-based sampling, this tracepoint can be used
> by BPF to run additional logic on each sample. For example, BPF can fetch
> stack traces and frequency count them in kernel context, for an efficient
> profiler.

Any comments on this patch? Thanks,

Brendan

Arnaldo Carvalho de Melo

unread,

Jul 29, 2016, 3:30:05 PM7/29/16

to

Em Tue, Jul 19, 2016 at 11:20:48PM +0000, Brendan Gregg escreveu:
> When perf is performing hrtimer-based sampling, this tracepoint can be used
> by BPF to run additional logic on each sample. For example, BPF can fetch
> stack traces and frequency count them in kernel context, for an efficient
> profiler.

Could you provide a complete experience? I.e. together with this patch a
bpf script that could then run, with the full set of steps needed to
show it in use.

Also, what would be the value when BPF is not used?

- Arnaldo

Brendan Gregg

unread,

Jul 29, 2016, 4:00:06 PM7/29/16

to

On Fri, Jul 29, 2016 at 12:21 PM, Arnaldo Carvalho de Melo
<ac...@kernel.org> wrote:
> Em Tue, Jul 19, 2016 at 11:20:48PM +0000, Brendan Gregg escreveu:
>> When perf is performing hrtimer-based sampling, this tracepoint can be used
>> by BPF to run additional logic on each sample. For example, BPF can fetch
>> stack traces and frequency count them in kernel context, for an efficient
>> profiler.
>
> Could you provide a complete experience? I.e. together with this patch a
> bpf script that could then run, with the full set of steps needed to
> show it in use.

There's currently profile.py, in bcc, which will either use this
tracepoint or use a kprobe if it doesn't exist (although the kprobe is
unreliable). profile samples stack traces and shows stack traces with
their occurrence counts. Eg:

# ./profile
Sampling at 49 Hertz of all threads by user + kernel stack... Hit Ctrl-C to end.
^C
ffffffff81189249 filemap_map_pages
ffffffff811bd3f5 handle_mm_fault
ffffffff81065990 __do_page_fault
ffffffff81065caf do_page_fault
ffffffff817ce228 page_fault
00007fed989afcc0 [unknown]
- cp (9036)
1
[...]

ffffffff8105eb66 native_safe_halt
ffffffff8103659e default_idle
ffffffff81036d1f arch_cpu_idle
ffffffff810bba5a default_idle_call
ffffffff810bbd07 cpu_startup_entry
ffffffff817bf4a7 rest_init
ffffffff81d65f58 start_kernel
ffffffff81d652db x86_64_start_reservations
ffffffff81d65418 x86_64_start_kernel
- swapper/0 (0)
72

ffffffff8105eb66 native_safe_halt
ffffffff8103659e default_idle
ffffffff81036d1f arch_cpu_idle
ffffffff810bba5a default_idle_call
ffffffff810bbd07 cpu_startup_entry
ffffffff8104df55 start_secondary
- swapper/1 (0)
75

Tool and examples are on github [1][2]. Is this sufficient for this
patch? If not, I could rewrite something for samples/bpf (eg, an IP
sampler, or a task priority sampler), which I may do anyway as a
follow-on if they turned out to be nice examples.

>
> Also, what would be the value when BPF is not used?
>

No big reason comes to mind. I could imagine it might be useful when
debugging perf's sampling behavior, and there might be uses with
ftrace as well. But the big reason is extending perf's existing
sampling capabilities for in-kernel frequency counts of stack traces
(which could include custom BPF-based stack walkers), IP, task
priority, etc. Thanks,

Brendan

[1] https://github.com/iovisor/bcc/blob/master/tools/profile.py
[2] https://github.com/iovisor/bcc/blob/master/tools/profile_example.txt

Wangnan (F)

unread,

Jul 29, 2016, 11:50:05 PM7/29/16

to

On 2016/7/30 2:05, Brendan Gregg wrote:
> On Tue, Jul 19, 2016 at 4:20 PM, Brendan Gregg <bgr...@netflix.com> wrote:
>> When perf is performing hrtimer-based sampling, this tracepoint can be used
>> by BPF to run additional logic on each sample. For example, BPF can fetch
>> stack traces and frequency count them in kernel context, for an efficient
>> profiler.
> Any comments on this patch? Thanks,
>
> Brendan

Sorry for the late.

I think it is a useful feature. Could you please provide an example
to show how to use it in perf?

If I understand correctly, I can have a BPF script run 99 times per
second using

# perf -e cpu-clock/freq=99/ -e mybpf.c ...

And in mybpf.c, attach a BPF script on the new tracepoint. Right?

Also, since we already have timer:hrtimer_expire_entry, please provide
some further information about why we need a new tracepoint.

Thank you.

Brendan Gregg

unread,

Aug 2, 2016, 11:00:05 PM8/2/16

to

When perf is performing hrtimer-based sampling, this tracepoint can be used
by BPF to run additional logic on each sample. For example, BPF can fetch
stack traces and frequency count them in kernel context, for an efficient
profiler.

index 356a6c7..78bce19 100644

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -51,6 +51,9 @@

#include <asm/irq_regs.h>

+#define CREATE_TRACE_POINTS
+#include <trace/events/perf.h>
+
typedef int (*remote_function_f)(void *);

struct remote_function_call {

@@ -8062,6 +8065,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)

Brendan Gregg

unread,

Aug 2, 2016, 11:00:08 PM8/2/16

to

On Fri, Jul 29, 2016 at 8:34 PM, Wangnan (F) <wang...@huawei.com> wrote:
>
>
> On 2016/7/30 2:05, Brendan Gregg wrote:
>>
>> On Tue, Jul 19, 2016 at 4:20 PM, Brendan Gregg <bgr...@netflix.com> wrote:
>>>
>>> When perf is performing hrtimer-based sampling, this tracepoint can be
>>> used
>>> by BPF to run additional logic on each sample. For example, BPF can fetch
>>> stack traces and frequency count them in kernel context, for an efficient
>>> profiler.
>>
>> Any comments on this patch? Thanks,
>>
>> Brendan
>
>
> Sorry for the late.
>
> I think it is a useful feature. Could you please provide an example
> to show how to use it in perf?

Yes, the following example samples at 999 Hertz, and emits the
instruction pointer only when it is within a custom address range, as
checked by BPF. Eg:

# ./perf record -e bpf-output/no-inherit,name=evt/ \
-e ./sampleip_range.c/map:channel.event=evt/ \
-a ./perf record -F 999 -e cpu-clock -N -a -o /dev/null sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB /dev/null ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.134 MB perf.data (222 samples) ]

# ./perf script -F comm,pid,time,bpf-output
'bpf-output' not valid for hardware events. Ignoring.
'bpf-output' not valid for unknown events. Ignoring.
'bpf-output' not valid for unknown events. Ignoring.
dd 6501 3058.117379:
BPF output: 0000: 3c 4c 21 81 ff ff ff ff <L!.....
0008: 00 00 00 00 ....

dd 6501 3058.130392:
BPF output: 0000: 55 4c 21 81 ff ff ff ff UL!.....
0008: 00 00 00 00 ....

dd 6501 3058.131393:
BPF output: 0000: 55 4c 21 81 ff ff ff ff UL!.....
0008: 00 00 00 00 ....

dd 6501 3058.149411:
BPF output: 0000: e1 4b 21 81 ff ff ff ff .K!.....
0008: 00 00 00 00 ....

dd 6501 3058.155417:
BPF output: 0000: 76 4c 21 81 ff ff ff ff vL!.....
0008: 00 00 00 00 ....

For that example, perf is running a BPF program to emit filtered
details, and running a second perf to configure sampling. We can
certainly improve how this works. And this will be much more
interesting once perf can emit maps, and a perf BPF program can
populate a map.

Here's sampleip_range.c:

/************************ BEGIN **************************/
#include <uapi/linux/bpf.h>
#include <uapi/linux/ptrace.h>

#define SEC(NAME) __attribute__((section(NAME), used))

/*
* Edit the following to match the instruction address range you want to
* sample. Eg, look in /proc/kallsyms. The addresses will change for each
* kernel version and build.
*/
#define RANGE_START 0xffffffff81214b90
#define RANGE_END 0xffffffff81214cd0

struct bpf_map_def {
unsigned int type;
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
};

static int (*probe_read)(void *dst, int size, void *src) =
(void *)BPF_FUNC_probe_read;
static int (*get_smp_processor_id)(void) =
(void *)BPF_FUNC_get_smp_processor_id;
static int (*perf_event_output)(void *, struct bpf_map_def *, int, void *,
unsigned long) = (void *)BPF_FUNC_perf_event_output;

struct bpf_map_def SEC("maps") channel = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = __NR_CPUS__,
};

/* from /sys/kernel/debug/tracing/events/perf/perf_hrtimer/format */
struct perf_hrtimer_args {
unsigned long long pad;
struct pt_regs *regs;
struct perf_event *event;
};
SEC("perf:perf_hrtimer")
int func(struct perf_hrtimer_args *ctx)
{
struct pt_regs regs = {};
probe_read(&regs, sizeof(regs), ctx->regs);
if (regs.ip >= RANGE_START && regs.ip < RANGE_END) {
perf_event_output(ctx, &channel, get_smp_processor_id(),
&regs.ip, sizeof(regs.ip));
}
return 0;
}

char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
/************************* END ***************************/

>
> If I understand correctly, I can have a BPF script run 99 times per
> second using
>
> # perf -e cpu-clock/freq=99/ -e mybpf.c ...
>
> And in mybpf.c, attach a BPF script on the new tracepoint. Right?
>
> Also, since we already have timer:hrtimer_expire_entry, please provide
> some further information about why we need a new tracepoint.

timer:hrtimer_expire_entry fires for much more than just the perf
timer. The perf:perf_hrtimer tracepoint also has registers and perf
context as arguments, which can be used for profiling programs.

Thanks for the comments,

Brendan

Peter Zijlstra

unread,

Aug 3, 2016, 6:20:04 AM8/3/16

to

On Wed, Aug 03, 2016 at 02:47:47AM +0000, Brendan Gregg wrote:
> When perf is performing hrtimer-based sampling, this tracepoint can be used
> by BPF to run additional logic on each sample. For example, BPF can fetch
> stack traces and frequency count them in kernel context, for an efficient
> profiler.

Urgh, no like.

And then next week we'll add a tracepoint to some other random pmu or
whatnot.

Brendan Gregg

unread,

Aug 3, 2016, 3:00:06 PM8/3/16

to

I wouldn't want random pmu tracepoints either, but I can see how it
looks with a new tracepoint category of "perf:", and maybe I shouldn't
have called it that. Would it be better if the tracepoint was called
"timer:perf_hrtimer"?

As for pmu tracepoints: if I were to instrument it (although I wasn't
planning to), I'd put a tracepoint in perf_event_overflow() called
"perf:perf_overflow", with the same arguments. That could then be used
for all PMU overflow events, without needing to add specific
tracepoints. The "perf:" category would then have two tracepoints, and
would not need to grow. Does that option sound better than
"timer:perf_hrtimer"? (As in, keeping the category "perf" so that we
have a later option of adding a perf:perf_overflow tracepoint if need
be?)

It's really perf_hrtimer that we'll use daily, for CPU profiling with
in-kernel aggregations of instruction pointers and stack traces, and
other sampling needs. It's one tracepoint that will add much value,
and will mean that BPF programs can then be attached to kprobes,
uprobes, tracepoints, and timed samples.

Thanks for your consideration,

Brendan

Peter Zijlstra

unread,

Aug 4, 2016, 10:30:06 AM8/4/16

to

On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:

> As for pmu tracepoints: if I were to instrument it (although I wasn't
> planning to), I'd put a tracepoint in perf_event_overflow() called
> "perf:perf_overflow", with the same arguments. That could then be used
> for all PMU overflow events, without needing to add specific
> tracepoints.

Could we not teach BPF to replace event->overflow_handler and inject
itself there?

We don't currently have nice interfaces for doing that, but it should be
possible to do I think. We already have the indirect function call, so
injecting ourself there has 0 overhead.

Alexei Starovoitov

unread,

Aug 4, 2016, 9:50:05 PM8/4/16

to

you're right. All makes sense. I guess I was too lazy to look into
how to do it properly. Adding a tracepoint looked like quick and
easy way to achieve the same.
As far as api goes probably existing IOC_SET_BPF ioctl will do too.
Currently overflow_handler is set at event alloc time. If we start
changing it on the fly with atomic xchg(), afaik things shouldn't
break, since each overflow_handler is run to completion and doesn't
change global state, right?

Brendan Gregg

unread,

Aug 5, 2016, 12:50:05 AM8/5/16

to

On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov
<alexei.st...@gmail.com> wrote:
> On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
>> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
>>
>> > As for pmu tracepoints: if I were to instrument it (although I wasn't
>> > planning to), I'd put a tracepoint in perf_event_overflow() called
>> > "perf:perf_overflow", with the same arguments. That could then be used
>> > for all PMU overflow events, without needing to add specific
>> > tracepoints.
>>
>> Could we not teach BPF to replace event->overflow_handler and inject
>> itself there?
>>
>> We don't currently have nice interfaces for doing that, but it should be
>> possible to do I think. We already have the indirect function call, so
>> injecting ourself there has 0 overhead.

Sounds like a good idea, especially for things like struct
file_operations so that we can statically instrument file system
read/writes with zero non-enabled overhead, and not worry about high
frequency workloads (>10M events/sec).

These perf probes aren't high frequency, though, and the code is not
normally in use, so overhead should be much less of a concern.
Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the
tracepoint code is still adding a mem read, conditional, and branch,
then that's not many instructions, especially considering the normal
use case of these perf functions: creating records and writing to a
perf ring buffer, then picking that up in user space by perf, then
either processing it live or writing to perf.data, back to the file
system, etc. It would be hard to benchmark the effect of adding a few
instructions to that path (and any results may be more sensitive to
cache line placement than the instructions).

The perf:perf_hrtimer probe point is also reading state mid-way
through a function, so it's not quite as simple as wrapping the
function pointer. I do like that idea, though, but for things like
struct file_operations.

>
> you're right. All makes sense. I guess I was too lazy to look into
> how to do it properly. Adding a tracepoint looked like quick and
> easy way to achieve the same.
> As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> Currently overflow_handler is set at event alloc time. If we start
> changing it on the fly with atomic xchg(), afaik things shouldn't
> break, since each overflow_handler is run to completion and doesn't
> change global state, right?
>

How would it be implemented? I was thinking of adding explicit wrappers, eg:

static ssize_t
__ext4_file_write_iter_tracepoint(struct kiocb *iocb, struct iov_iter *from)
{
trace_ext4_write_iter(iocb, from);
ext4_file_write_iter(iocb, from);
}

Then switching in __ext4_file_write_iter_tracepoint() as needed.

I was wondering about doing this dynamically, but wouldn't that then
create another problem of maintenance -- we'd need to decorate such
code with a comment, at least, to say "this function is exposed as a
tracepoint".

If a dynamic approach is still desired, and the goal is zero
non-enabled overhead, then I'd also wonder if we could leverage
kprobes to do this. Imagine walking a file_operations to find the
addresses, and then kprobe-ing the addresses. Not the same (or
probably as efficient) as setting the function pointer, but, using
existing kprobes.

Brendan

Alexei Starovoitov

unread,

Aug 5, 2016, 1:30:05 AM8/5/16

to

tracepoints are actually zero overhead already via static-key mechanism.
I don't think Peter's objection for the tracepoint was due to overhead.

> The perf:perf_hrtimer probe point is also reading state mid-way
> through a function, so it's not quite as simple as wrapping the
> function pointer. I do like that idea, though, but for things like
> struct file_operations.
>
> >
> > you're right. All makes sense. I guess I was too lazy to look into
> > how to do it properly. Adding a tracepoint looked like quick and
> > easy way to achieve the same.
> > As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> > Currently overflow_handler is set at event alloc time. If we start
> > changing it on the fly with atomic xchg(), afaik things shouldn't
> > break, since each overflow_handler is run to completion and doesn't
> > change global state, right?
> >
>
> How would it be implemented? I was thinking of adding explicit wrappers, eg:

instead of adding a tracepoint to perf_swevent_hrtimer we can replace
overflow_handler for that particular event with some form of bpf wrapper.
(probably new bpf program type). Then not only periodic events
will be triggering bpf prog, but pmu events as well.
So instead of normal __perf_event_output() writing into ringbuffer,
a bpf prog will be called that can optionally write into different
rb via bpf_perf_event_output. The question is what to pass into the
program to make the most use out of it. 'struct pt_regs' is done deal.
but perf_sample_data we cannot pass as-is, since it's kernel internal.
Probably something similar to __sk_buff mirror would be needed.
Another nice benefit of doing via overflow_handler instead of tracepoint
is that exclude_idle, exclude_user, exclude_kernel flags of the perf event
will all magically work and program will be event specific.
So two parallel 'perf record'-like sampling won't conflict.

Peter Zijlstra

unread,

Aug 5, 2016, 7:00:05 AM8/5/16

to

On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote:
> tracepoints are actually zero overhead already via static-key mechanism.
> I don't think Peter's objection for the tracepoint was due to overhead.

Almost 0, they still have some I$ footprint, but yes. My main worry is
that we can feed tracepoints into perf, so having tracepoints in perf is
tricky.

I also don't much like this tracepoint being specific to the hrtimer
bits, I can well imagine people wanting to do the same thing for
hardware based samples or whatnot.

> > The perf:perf_hrtimer probe point is also reading state mid-way
> > through a function, so it's not quite as simple as wrapping the
> > function pointer. I do like that idea, though, but for things like
> > struct file_operations.

So what additional state to you need?

> > > Currently overflow_handler is set at event alloc time. If we start
> > > changing it on the fly with atomic xchg(), afaik things shouldn't
> > > break, since each overflow_handler is run to completion and doesn't
> > > change global state, right?

Yes, or even a simple WRITE_ONCE() to replace it, as long as we make
sure to use a READ_ONCE() to load the pointer.

As long as we're sure to limit this poking to a single user its fairly
simple to get right. The moment there can be concurrency a lot of fail
can happen.

> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
> overflow_handler for that particular event with some form of bpf wrapper.
> (probably new bpf program type). Then not only periodic events
> will be triggering bpf prog, but pmu events as well.

Exactly.

> So instead of normal __perf_event_output() writing into ringbuffer,
> a bpf prog will be called that can optionally write into different
> rb via bpf_perf_event_output.

It could even chain and call into the original function once its done
and have both outputs.

> The question is what to pass into the
> program to make the most use out of it. 'struct pt_regs' is done deal.
> but perf_sample_data we cannot pass as-is, since it's kernel internal.

Urgh, does it have to be stable API? Can't we simply rely on the kernel
headers to provide the right structure definition?

Brendan Gregg

unread,

Aug 5, 2016, 1:30:05 PM8/5/16

to

On Fri, Aug 5, 2016 at 3:52 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote:
>> tracepoints are actually zero overhead already via static-key mechanism.
>> I don't think Peter's objection for the tracepoint was due to overhead.
>
> Almost 0, they still have some I$ footprint, but yes. My main worry is
> that we can feed tracepoints into perf, so having tracepoints in perf is
> tricky.

Coincidentally I$ footprint was my most recent use case for needing
this: I have an I$ busting workload, and wanting to profile
instructions at a very high rate to get a breakdown of I$ population.
(Normally I'd use I$ miss overflow, but none of our Linux systems have
PMCs: cloud.)

> I also don't much like this tracepoint being specific to the hrtimer
> bits, I can well imagine people wanting to do the same thing for
> hardware based samples or whatnot.

Sure, which is why I thought we'd have two in a perf category. I'm all
for PMCs events, even though we can't currently use them!

>
>> > The perf:perf_hrtimer probe point is also reading state mid-way
>> > through a function, so it's not quite as simple as wrapping the
>> > function pointer. I do like that idea, though, but for things like
>> > struct file_operations.
>
> So what additional state to you need?

I was pulling in regs after get_irq_regs(), struct perf_event *event
after it's populated. Not that hard to duplicate. Just noting it
didn't map directly to the function entry.

I wanted perf_event just for event->ctx->task->pid, so that a BPF
program can differentiate between it's samples and other concurrent
sessions.

(I was thinking of changing my patch to expose pid_t instead of
perf_event, since I was noticing it didn't add many instructions.)

[...]

>> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
>> overflow_handler for that particular event with some form of bpf wrapper.
>> (probably new bpf program type). Then not only periodic events
>> will be triggering bpf prog, but pmu events as well.
>
> Exactly.

Although the timer use case is a bit different, and is via
hwc->hrtimer.function = perf_swevent_hrtimer.

[...]

>> The question is what to pass into the
>> program to make the most use out of it. 'struct pt_regs' is done deal.
>> but perf_sample_data we cannot pass as-is, since it's kernel internal.
>
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?

For timer it can be: struct pt_regs, pid_t.

So that would restrict your BPF program to one timer, since if you had
two (from one pid) you couldn't tell them apart. But I'm not sure of a
use case for two in-kernel timers. If there were, we could also add
struct perf_event_attr, which has enough info to tell things apart,
and is already exposed to user space.

I haven't looked into the PMU arguments, but perhaps that could be:
struct pt_regs, pid_t, struct perf_event_attr.

Thanks,

Brendan

Alexei Starovoitov

unread,

Aug 5, 2016, 3:50:04 PM8/5/16

to

On Fri, Aug 05, 2016 at 12:52:09PM +0200, Peter Zijlstra wrote:
> > > > Currently overflow_handler is set at event alloc time. If we start
> > > > changing it on the fly with atomic xchg(), afaik things shouldn't
> > > > break, since each overflow_handler is run to completion and doesn't
> > > > change global state, right?
>
> Yes, or even a simple WRITE_ONCE() to replace it, as long as we make
> sure to use a READ_ONCE() to load the pointer.
>
> As long as we're sure to limit this poking to a single user its fairly
> simple to get right. The moment there can be concurrency a lot of fail
> can happen.

agreed.

> > So instead of normal __perf_event_output() writing into ringbuffer,
> > a bpf prog will be called that can optionally write into different
> > rb via bpf_perf_event_output.
>
> It could even chain and call into the original function once its done
> and have both outputs.

interesting idea. makes sense.
Also thinking about concurrency and the need to remember the original
handler somewhere, would it be cleaner api to add a bit to perf_event_attr
and use attr.config1 as bpf_fd ?
Then perf_event_open at event creation time will use bpf prog as
overflow_handler. That solves concurrency concerns and potential semantical
issues if we go with ioctl() approach.
Like if we perf_event_open() an event for a task, then bpf attach to it,
what children task and corresponding inherited events suppose to do?
Inherit overflow_handler, right? but then deatch of bpf in the parent
suppose to clear it in inherited events as well. A bit complicated.
I guess we can define it that way.
Just seems easier to do bpf attach at perf_event_open time only.

> > The question is what to pass into the
> > program to make the most use out of it. 'struct pt_regs' is done deal.
> > but perf_sample_data we cannot pass as-is, since it's kernel internal.
>
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?

yes we can. The concern is about assumptions people will make about
perf_sample_data and the speed of access to it. From bpf program point
of view the pointer to perf_sample_data will be opaque unsafe pointer,
so any access to fields would have to be done via bpf_probe_read which
has non-trivial overhead.
If we go with the uapi mirror of perf_sample_data approach, it will be
fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we
have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere
and no fields are copied. When bpf program does 'skb->vlan_present'
the verifier rewrites it at load time into corresponding access to
kernel internal 'struct sk_buff' fields with bitmask, shifts and such.
For this case we can define something like
struct bpf_perf_sample_data {
__u64 period;
};
then bpf prog will only be able to access that signle field which verifier
will translate into 'data->period' where data is 'struct perf_sample_data *'
Later we can add other fields if necessary. The kernel is free to mess
around with perf_sample_data whichever way without impacting bpf progs.

Peter Zijlstra

unread,

Aug 8, 2016, 6:00:08 AM8/8/16

to

On Fri, Aug 05, 2016 at 10:22:08AM -0700, Brendan Gregg wrote:

> (Normally I'd use I$ miss overflow, but none of our Linux systems have
> PMCs: cloud.)

I think I had better not comment on that ;-)

> >> > The perf:perf_hrtimer probe point is also reading state mid-way
> >> > through a function, so it's not quite as simple as wrapping the
> >> > function pointer. I do like that idea, though, but for things like
> >> > struct file_operations.
> >
> > So what additional state to you need?
>
> I was pulling in regs after get_irq_regs(), struct perf_event *event
> after it's populated. Not that hard to duplicate. Just noting it
> didn't map directly to the function entry.

Right, both of which are available to the overflow handler.

> I wanted perf_event just for event->ctx->task->pid, so that a BPF
> program can differentiate between it's samples and other concurrent
> sessions.
>
> (I was thinking of changing my patch to expose pid_t instead of
> perf_event, since I was noticing it didn't add many instructions.)

Slightly confused, event->ctx->task == current, no? We flip that pointer
when we flip the contexts.

At which point, it should be the same as SAMPLE_TID.

!?

> [...]
> >> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
> >> overflow_handler for that particular event with some form of bpf wrapper.
> >> (probably new bpf program type). Then not only periodic events
> >> will be triggering bpf prog, but pmu events as well.
> >
> > Exactly.
>
> Although the timer use case is a bit different, and is via
> hwc->hrtimer.function = perf_swevent_hrtimer.

Still not entirely sure why you could not hook into
event->overflow_handler.

Peter Zijlstra

unread,

Aug 8, 2016, 6:10:05 AM8/8/16

to

On Fri, Aug 05, 2016 at 12:45:07PM -0700, Alexei Starovoitov wrote:

> Also thinking about concurrency and the need to remember the original
> handler somewhere, would it be cleaner api to add a bit to perf_event_attr
> and use attr.config1 as bpf_fd ?

attr.config[12] are in use already, typically uncore events need them.
We cannot rely on those bits being unused.

> Then perf_event_open at event creation time will use bpf prog as
> overflow_handler. That solves concurrency concerns and potential semantical
> issues if we go with ioctl() approach.

> Like if we perf_event_open() an event for a task, then bpf attach to it,
> what children task and corresponding inherited events suppose to do?
> Inherit overflow_handler, right? but then deatch of bpf in the parent
> suppose to clear it in inherited events as well. A bit complicated.
> I guess we can define it that way.
> Just seems easier to do bpf attach at perf_event_open time only.

Which is why I would've liked BPF to create its own events, instead of
userspace passing them in through this array.

Because now you have a chicken'n'egg issue with that you want BPF to use
the overflow handler but need BPF running before you have an actual
handler to link to.

> > Urgh, does it have to be stable API? Can't we simply rely on the kernel
> > headers to provide the right structure definition?
>
> yes we can. The concern is about assumptions people will make about
> perf_sample_data and the speed of access to it. From bpf program point
> of view the pointer to perf_sample_data will be opaque unsafe pointer,
> so any access to fields would have to be done via bpf_probe_read which
> has non-trivial overhead.
> If we go with the uapi mirror of perf_sample_data approach, it will be
> fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we
> have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere
> and no fields are copied. When bpf program does 'skb->vlan_present'
> the verifier rewrites it at load time into corresponding access to
> kernel internal 'struct sk_buff' fields with bitmask, shifts and such.
> For this case we can define something like
> struct bpf_perf_sample_data {
> __u64 period;
> };
> then bpf prog will only be able to access that signle field which verifier
> will translate into 'data->period' where data is 'struct perf_sample_data *'
> Later we can add other fields if necessary. The kernel is free to mess
> around with perf_sample_data whichever way without impacting bpf progs.

Hmm, I was not aware of that. Should be doable indeed.