[PATCH 4/4] perf core: Add backward attribute to perf event

Wang Nan

unread,

Mar 28, 2016, 2:50:06 AM3/28/16

to

This patch introduces 'write_backward' bit to perf_event_attr, which
controls the direction of a ring buffer. After set, the corresponding
ring buffer is written from end to beginning. This feature is design to
support reading from overwritable ring buffer.

Ring buffer can be created by mapping a perf event fd. Kernel puts event
records into ring buffer, user programs like perf fetch them from
address returned by mmap(). To prevent racing between kernel and perf,
they communicate to each other through 'head' and 'tail' pointers.
Kernel maintains 'head' pointer, points it to the next free area (tail
of the last record). Perf maintains 'tail' pointer, points it to the
tail of last consumed record (record has already been fetched). Kernel
determines the available space in a ring buffer using these two
pointers, prevents to overwrite unfetched records.

By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
Different from normal ring buffer, perf is unable to maintain 'tail'
pointer because writing is forbidden. Therefore, for this type of ring
buffers, kernel overwrite old records unconditionally, works like flight
recorder. This feature would be useful if reading from overwritable ring
buffer were as easy as reading from normal ring buffer. However,
there's an obscure problem.

The following figure demonstrates the state of an overwritable ring
buffer which is nearly full. In this figure, the 'head' pointer points
to the end of last record, and a long record 'E' is pending. For a
normal ring buffer, a 'tail' pointer would have pointed to position (X),
so kernel knows there's no more space in the ring buffer. However, for
an overwritable ring buffer, kernel doesn't care the 'tail' pointer.

(X) head
. |
. V
+------+-------+----------+------+---+
|A....A|B.....B|C........C|D....D| |
+------+-------+----------+------+---+

After writing record 'E', record 'A' is overwritten.

head
|
V
+--+---+-------+----------+------+---+
|.E|..A|B.....B|C........C|D....D|E..|
+--+---+-------+----------+------+---+

Now perf decides to read from this ring buffer. However, none of the
the two natural positions, 'head' and the start of this ring buffer,
are pointing to the head of a record. Even perf can read the full ring
buffer, it is unable to find the position to start decoding.

The first attempt tries to solve this problem AFAIK can be found from
[1]. It makes kernel to maintain 'tail' pointer: updates it when ring
buffer is half full. However, this approach introduces overhead to
fast path. Test result shows a 1% overhead [2]. In addition, this method
utilizes no more tham 50% records.

Another attempt can be found from [3], which allow putting the size of
an event at the end of each record. This approach allows perf to find
records in a backword manner from 'head' pointer by reading size of a
record from its tail. However, because of alignment requirement, it
needs 8 bytes to record the size of a record, which is a huge waste. Its
performance is also not good, because more data need to be written.
This approach also introduces some extra branch instructions to fast
path.

'write_backward' is a better solution to this problem.

Following figure demonstrates the state of the overwritable ring buffer
when 'write_backward' is set before overwriting:

head
|
V
+---+------+----------+-------+------+
| |D....D|C........C|B.....B|A....A|
+---+------+----------+-------+------+

and after overwriting:
head
|
V
+---+------+----------+-------+---+--+
|..E|D....D|C........C|B.....B|A..|E.|
+---+------+----------+-------+---+--+

In each situation, 'head' points to the beginning of the newest record.
From this record, perf can iterate over the full ring buffer, fetching
as mush records as possible one by one.

The only limitation needs to consider is back-to-back reading. Due to
the non-deterministic of user program, it is impossible to ensure the
ring buffer keeps stable during reading. Consider an extreme situation:
perf is scheduled out after reading record 'D', then a burst of events
come, eat up the whole ring buffer (one or multiple rounds), but 'head'
pointer happends to be at the same position when perf comes back.
Continue reading after 'D' is incorrect now.

To prevent this problem, we need to find a way to ensure the ring buffer
is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
suggested because its overhead is lower than
ioctl(PERF_EVENT_IOC_ENABLE).

This patch utilizes event's default overflow_handler introduced
previously. perf_event_output_backward() is created as the default
overflow handler for backward ring buffers. To avoid extra overhead to
fast path, original perf_event_output() becomes __perf_event_output()
and marked '__always_inline'. In theory, there's no extra overhead
introduced to fast path.

Performance result:

Calling 3000000 times of 'close(-1)', use gettimeofday() to check
duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
system calls. In ns.

Testing environment:

CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Kernel : v4.5.0
MEAN STDVAR
BASE 800214.950 2853.083
PRE1 2253846.700 9997.014
PRE2 2257495.540 8516.293
POST 2250896.100 8933.921

Where 'BASE' is pure performance without capturing. 'PRE1' is test
result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
patch. 'POST' is test result after this patch. See [4] for detail
experimental setup.

Considering the stdvar, this patch doesn't introduce performance
overhead to fast path.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
[2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
[3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
[4] http://lkml.kernel.org/g/56F89DCD...@huawei.com

Signed-off-by: Wang Nan <wang...@huawei.com>
Cc: He Kuang <hek...@huawei.com>
Cc: Alexei Starovoitov <a...@kernel.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Brendan Gregg <brendan...@gmail.com>
Cc: Jiri Olsa <jo...@kernel.org>
Cc: Masami Hiramatsu <masami.hi...@hitachi.com>
Cc: Namhyung Kim <namh...@kernel.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Zefan Li <liz...@huawei.com>
Cc: pi3o...@163.com
---
include/linux/perf_event.h | 28 +++++++++++++++++++++---
include/uapi/linux/perf_event.h | 3 ++-
kernel/events/core.c | 48 ++++++++++++++++++++++++++++++++++++-----
kernel/events/ring_buffer.c | 14 ++++++++++++
4 files changed, 84 insertions(+), 9 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4065ca2..0cc36ad 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -834,14 +834,24 @@ extern int perf_event_overflow(struct perf_event *event,
struct perf_sample_data *data,
struct pt_regs *regs);

+extern void perf_event_output_forward(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs);
+extern void perf_event_output_backward(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs);
extern void perf_event_output(struct perf_event *event,
- struct perf_sample_data *data,
- struct pt_regs *regs);
+ struct perf_sample_data *data,
+ struct pt_regs *regs);

static inline bool
is_default_overflow_handler(struct perf_event *event)
{
- return (event->overflow_handler == perf_event_output);
+ if (likely(event->overflow_handler == perf_event_output_forward))
+ return true;
+ if (unlikely(event->overflow_handler == perf_event_output_backward))
+ return true;
+ return false;
}

extern void
@@ -1042,8 +1052,20 @@ static inline bool has_aux(struct perf_event *event)
return event->pmu->setup_aux;
}

+static inline bool is_write_backward(struct perf_event *event)
+{
+ return !!event->attr.write_backward;
+}
+
extern int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size);
+extern int perf_output_begin_forward(struct perf_output_handle *handle,
+ struct perf_event *event,
+ unsigned int size);
+extern int perf_output_begin_backward(struct perf_output_handle *handle,
+ struct perf_event *event,
+ unsigned int size);
+
extern void perf_output_end(struct perf_output_handle *handle);
extern unsigned int perf_output_copy(struct perf_output_handle *handle,
const void *buf, unsigned int len);
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index a3c1903..43fc8d2 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -340,7 +340,8 @@ struct perf_event_attr {
comm_exec : 1, /* flag comm events that are due to an exec */
use_clockid : 1, /* use @clockid for time fields */
context_switch : 1, /* context switch data */
- __reserved_1 : 37;
+ write_backward : 1, /* Write ring buffer from end to beginning */
+ __reserved_1 : 36;

union {
__u32 wakeup_events; /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3bd4b2b..41a2614 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5641,9 +5641,13 @@ void perf_prepare_sample(struct perf_event_header *header,
}
}

-void perf_event_output(struct perf_event *event,
- struct perf_sample_data *data,
- struct pt_regs *regs)
+static void __always_inline
+__perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs,
+ int (*output_begin)(struct perf_output_handle *,
+ struct perf_event *,
+ unsigned int))
{
struct perf_output_handle handle;
struct perf_event_header header;
@@ -5653,7 +5657,7 @@ void perf_event_output(struct perf_event *event,

perf_prepare_sample(&header, data, event, regs);

- if (perf_output_begin(&handle, event, header.size))
+ if (output_begin(&handle, event, header.size))
goto exit;

perf_output_sample(&handle, &header, data, event);
@@ -5664,6 +5668,30 @@ exit:
rcu_read_unlock();
}

+void
+perf_event_output_forward(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ __perf_event_output(event, data, regs, perf_output_begin_forward);
+}
+
+void
+perf_event_output_backward(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ __perf_event_output(event, data, regs, perf_output_begin_backward);
+}
+
+void
+perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ __perf_event_output(event, data, regs, perf_output_begin);
+}
+
/*
* read event_id
*/
@@ -8017,8 +8045,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (overflow_handler) {
event->overflow_handler = overflow_handler;
event->overflow_handler_context = context;
+ } else if (is_write_backward(event)){
+ event->overflow_handler = perf_event_output_backward;
+ event->overflow_handler_context = NULL;
} else {
- event->overflow_handler = perf_event_output;
+ event->overflow_handler = perf_event_output_forward;
event->overflow_handler_context = NULL;
}

@@ -8253,6 +8284,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
goto out;

/*
+ * Either writing ring buffer from beginning or from end.
+ * Mixing is not allowed.
+ */
+ if (is_write_backward(output_event) != is_write_backward(event))
+ goto out;
+
+ /*
* If both events generate aux data, they must be on the same PMU
*/
if (has_aux(event) && has_aux(output_event) &&
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index b2c7c15..8e6c4b5 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -230,9 +230,23 @@ out:
return -ENOSPC;
}

+int perf_output_begin_forward(struct perf_output_handle *handle,
+ struct perf_event *event, unsigned int size)
+{
+ return __perf_output_begin(handle, event, size, false);
+}
+
+int perf_output_begin_backward(struct perf_output_handle *handle,
+ struct perf_event *event, unsigned int size)
+{
+ return __perf_output_begin(handle, event, size, true);
+}
+
int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size)
{
+ if (unlikely(is_write_backward(event)))
+ return __perf_output_begin(handle, event, size, true);
return __perf_output_begin(handle, event, size, false);
}

--
1.8.3.4

Wang Nan

unread,

Mar 28, 2016, 6:20:07 AM3/28/16

to

Signed-off-by: Wang Nan <wang...@huawei.com>
---
man2/perf_event_open.2 | 57 ++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
index b232cba..942a410 100644
--- a/man2/perf_event_open.2
+++ b/man2/perf_event_open.2
@@ -234,8 +234,10 @@ struct perf_event_attr {
mmap2 : 1, /* include mmap with inode data */
comm_exec : 1, /* flag comm events that are due to exec */
use_clockid : 1, /* use clockid for time fields */
+ context_switch : 1, /* context switch data */

+ write_backward : 1, /* Write ring buffer from end to beginning */

- __reserved_1 : 38;

+ __reserved_1 : 36;

union {
__u32 wakeup_events; /* wakeup every n events */

@@ -1105,6 +1107,30 @@ field.
This can make it easier to correlate perf sample times with
timestamps generated by other tools.
.TP
+.IR "write_backward" " (since Linux 4.6)"
+.\" commit ? (http://lkml.kernel.org/g/1459147292-239310-5-g...@huawei.com)
+This makes the resuling event use a backward ring-buffer, which
+writes samples from the end of the ring-buffer.
+
+It is not allowed to connect events with backward and forward
+ring-buffer settings together using
+.B PERF_EVENT_IOC_SET_OUTPUT.
+
+Backward ring-buffer is useful when the ring-buffer is overwritable
+(created by readonly
+.BR mmap (2)
+). In this case,
+.IR data_tail
+is useless,
+.IR data_head
+points to the head of the most recent sample in a backward
+ring-buffer. It is easy to iterate over the whole ring-buffer by reading
+samples one by one because size of a sample can be found from decoding
+its header. In contract, in a forward overwritable ring-buffer, the only
+information is the end of the most recent sample which is pointed by
+.IR data_head,
+but the size of a sample can't be determined from the end of it.
+.TP
.IR "wakeup_events" ", " "wakeup_watermark"
This union sets how many samples
.RI ( wakeup_events )
@@ -1634,7 +1660,9 @@ And vice versa:
.TP
.I data_head
This points to the head of the data section.
-The value continuously increases, it does not wrap.
+The value continuously increases (or decrease if
+.IR write_backward
+is set), it does not wrap.
The value needs to be manually wrapped by the size of the mmap buffer
before accessing the samples.

@@ -2581,6 +2609,24 @@ Starting with Linux 3.18,
.B POLL_HUP
is indicated if the event being monitored is attached to a different
process and that process exits.
+.SS Reading from overwritable ring-buffer
+Reader is unable to update
+.IR data_tail
+if the mapping is not
+.BR PROT_WRITE .
+In this case, kernel will overwrite data without considering whether
+they are read or not, so ring-buffer is overwritable and
+behaves like a flight recorder. To read from an overwritable
+ring-buffer, setting
+.IR write_backward
+is suggested, or it would be hard to find a proper position to start
+decoding. In addition, ring-buffer should be paused before reading
+through
+.BR ioctl (2)
+with
+.B PERF_EVENT_IOC_PAUSE_OUTPUT
+to avoid racing between kernel and reader. Ring-buffer should be resumed
+after finish reading.
.SS rdpmc instruction
Starting with Linux 3.4 on x86, you can use the
.\" commit c7206205d00ab375839bd6c7ddb247d600693c09
@@ -2693,6 +2739,13 @@ The file descriptors must all be on the same CPU.

The argument specifies the desired file descriptor, or \-1 if
output should be ignored.
+
+Two events with different
+.IR write_backward
+settings are not allowed to be connected together using
+.B PERF_EVENT_IOC_SET_OUTPUT.
+.B EINVAL
+is returned in this case.
.TP
.BR PERF_EVENT_IOC_SET_FILTER " (since Linux 2.6.33)"
.\" commit 6fb2915df7f0747d9044da9dbff5b46dc2e20830
--
1.8.3.4

Alexei Starovoitov

unread,

Mar 28, 2016, 8:30:06 PM3/28/16

to

Very useful feature. Looks good.
Acked-by: Alexei Starovoitov <a...@kernel.org>

Wangnan (F)

unread,

Mar 28, 2016, 10:10:06 PM3/28/16

to

On 2016/3/28 14:41, Wang Nan wrote:

[SNIP]

>
> To prevent this problem, we need to find a way to ensure the ring buffer
> is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
> suggested because its overhead is lower than
> ioctl(PERF_EVENT_IOC_ENABLE).
>

Add comment:

By carefully verifying 'header' pointer, reader can avoid pausing the
ring-buffer. For example:

/* A union of all possible events */
union perf_event event;

p = head = perf_mmap__read_head();
while (true) {
/* copy header of next event */
fetch(&event.header, p, sizeof(event.header));

/* read 'head' pointer */
head = perf_mmap__read_head();

/* check overwritten: is the header good? */
if (!verify(sizeof(event.header), p, head))
break;

/* copy the whole event */
fetch(&event, p, event.header.size);

/* read 'head' pointer again */
head = perf_mmap__read_head();

/* is the whole event good? */
if (!verify(event.header.size, p, head))
break;
p += event.header.size;
}

However, the overhead is high because:

a) In-place decoding is unsafe. Copy-verifying-decode is required.
b) Fetching 'head' pointer requires additional synchronization.

Alexei Starovoitov

unread,

Mar 29, 2016, 1:00:06 AM3/29/16

to

Such trick may work, but pause is needed for more than stability
of reading. When we collect the events into overwrite buffer
we're waiting for some other trigger (like all cpu utilization
spike or just one cpu running and all others are idle) and when
it happens the buffer has valuable info from the past. At this
point new events are no longer interesting and buffer should
be paused, events read and unpaused until next trigger comes.

Wangnan (F)

unread,

Mar 29, 2016, 2:10:06 AM3/29/16

to

Agree. I just want to provide an alternative method.
I'm trying to finger out pausing is not mandatory
but highly recommended in man page and commit
messages.

Thank you.

Peter Zijlstra

unread,

Mar 29, 2016, 10:10:06 AM3/29/16

to

On Mon, Mar 28, 2016 at 06:41:32AM +0000, Wang Nan wrote:

Could you maybe write a perf/tests thingy for this so that _some_
userspace exists that exercises this new code?

> int perf_output_begin(struct perf_output_handle *handle,
> struct perf_event *event, unsigned int size)
> {
> + if (unlikely(is_write_backward(event)))
> + return __perf_output_begin(handle, event, size, true);
> return __perf_output_begin(handle, event, size, false);
> }

Would something like:

int perf_output_begin(...)
{
if (unlikely(is_write_backward(event))
return perf_output_begin_backward(...);
return perf_output_begin_forward(...);
}

make sense; I'm not sure how much is still using this, but it seems
somewhat excessive to inline two copies of that thing into a single
function.

Alternatively; something like:

int perf_output_begin(...)
{
return __perf_output_begin(..., unlikely(event->attr.backwards));
}

might make sense too.

Wangnan (F)

unread,

Mar 29, 2016, 10:30:07 PM3/29/16

to

On 2016/3/29 22:04, Peter Zijlstra wrote:
> On Mon, Mar 28, 2016 at 06:41:32AM +0000, Wang Nan wrote:
>
> Could you maybe write a perf/tests thingy for this so that _some_
> userspace exists that exercises this new code?
>
>
>> int perf_output_begin(struct perf_output_handle *handle,
>> struct perf_event *event, unsigned int size)
>> {
>> + if (unlikely(is_write_backward(event)))
>> + return __perf_output_begin(handle, event, size, true);
>> return __perf_output_begin(handle, event, size, false);
>> }
> Would something like:
>
> int perf_output_begin(...)
> {
> if (unlikely(is_write_backward(event))
> return perf_output_begin_backward(...);
> return perf_output_begin_forward(...);
> }
>
> make sense; I'm not sure how much is still using this, but it seems
> somewhat excessive to inline two copies of that thing into a single
> function.

perf_output_begin is useful:

$ grep perf_output_begin ./kernel -r
./kernel/events/ring_buffer.c: * See perf_output_begin().
./kernel/events/ring_buffer.c:int perf_output_begin(struct
perf_output_handle *handle,
./kernel/events/ring_buffer.c: * perf_output_begin() only checks
rb->paused, therefore
./kernel/events/core.c: if (perf_output_begin(&handle, event,
header.size))
./kernel/events/core.c: ret = perf_output_begin(&handle, event,
read_event.header.size);
./kernel/events/core.c: ret = perf_output_begin(&handle, event,
./kernel/events/core.c: ret = perf_output_begin(&handle, event,
./kernel/events/core.c: ret = perf_output_begin(&handle, event,
./kernel/events/core.c: ret = perf_output_begin(&handle, event,
rec.header.size);
./kernel/events/core.c: ret = perf_output_begin(&handle, event,
./kernel/events/core.c: ret = perf_output_begin(&handle, event,
se->event_id.header.size);
./kernel/events/core.c: ret = perf_output_begin(&handle, event,
./kernel/events/core.c: ret = perf_output_begin(&handle, event,
rec.header.size);

Events like PERF_RECORD_MMAP2 uses this function, so we still need to
consider its overhead.

So I will use your first suggestion.

Thank you.

Wangnan (F)

unread,

Mar 29, 2016, 10:40:07 PM3/29/16

to

Sorry. Your second suggestion seems also good:

My implementation makes a big perf_output_begin(), but introduces only
one load and one branch.

Your first suggestion introduces one load, one branch and one function call.

Your second suggestion introduces one load, and at least one (and at
most three) branches.

I need some benchmarking result.

Thank you.

Wangnan (F)

unread,

Apr 5, 2016, 10:10:06 AM4/5/16

to

On 2016/3/30 10:38, Wangnan (F) wrote:
>
>
> On 2016/3/30 10:28, Wangnan (F) wrote:
>>
>>
>> On 2016/3/29 22:04, Peter Zijlstra wrote:
>>> On Mon, Mar 28, 2016 at 06:41:32AM +0000, Wang Nan wrote:
>>>
>>> Could you maybe write a perf/tests thingy for this so that _some_
>>> userspace exists that exercises this new code?
>>>
>>>
>>>> int perf_output_begin(struct perf_output_handle *handle,
>>>> struct perf_event *event, unsigned int size)
>>>> {
>>>> + if (unlikely(is_write_backward(event)))
>>>> + return __perf_output_begin(handle, event, size, true);
>>>> return __perf_output_begin(handle, event, size, false);
>>>> }
>>> Would something like:
>>>
>>> int perf_output_begin(...)
>>> {
>>> if (unlikely(is_write_backward(event))
>>> return perf_output_begin_backward(...);
>>> return perf_output_begin_forward(...);
>>> }
>>>
>>> make sense; I'm not sure how much is still using this, but it seems
>>> somewhat excessive to inline two copies of that thing into a single
>>> function.
>>
>>

[SNIP]

>
> Sorry. Your second suggestion seems also good:
>
> My implementation makes a big perf_output_begin(), but introduces only
> one load and one branch.
>
> Your first suggestion introduces one load, one branch and one function
> call.
>
> Your second suggestion introduces one load, and at least one (and at
> most three) branches.
>
> I need some benchmarking result.
>
> Thank you.

No obviously performance divergence among all 3 implementations.

Here are some numbers:

I tested the cost of generating PERF_RECORD_COMM event using prctl with
following code:

...
gettimeofday(&tv1, NULL);
for (i = 0; i < 1000 * 1000 * 3; i++) {
char proc_name[10];

snprintf(proc_name, sizeof(proc_name), "p:%d\n", i);
prctl(PR_SET_NAME, proc_name);
}
gettimeofday(&tv2, NULL);
us1 = tv1.tv_sec * 1000000 + tv1.tv_usec;
us2 = tv2.tv_sec * 1000000 + tv2.tv_usec;
printf("%ld\n", us2 - us1);
...

Run this benchmark 100 time in each experiment. Bind benchmark to core 2
and perf to core 1 to ensure they are on a same CPU.

Result:

BASE : execute without perf
4.5 : pure v4.5
TIP : with only patch 1-3/4 in this patch set applied
BIGFUNC : the implementation in my original patch
FUNCCALL: the implememtation in Peter's first suggestion:

int perf_output_begin(...)
{
if (unlikely(is_write_backward(event))
return perf_output_begin_backward(...);
return perf_output_begin_forward(...);
}

BRANCH : the implememtation in Peter's second suggestion:

int perf_output_begin(...)
{
return __perf_output_begin(..., unlikely(event->attr.backwards));
}

'perf' is executed using:
# perf record -o /dev/null --no-buildid-cache -e
syscalls:sys_enter_read ...

Results:

MEAN STDVAR
BASE : 1122968.85 33492.52
4.5 : 2714200.70 26231.69
TIP : 2646260.46 32610.56
BIGFUNC : 2661308.46 52707.47
FUNCCALL: 2636061.10 52607.80
BRANCH : 2651335.74 34910.04

Considering the stdvar, the performance result is nearly identical.

I'd like to choose 'BRANCH' because its code looks better.

Thank you.

Wang Nan

unread,

Apr 5, 2016, 10:20:07 AM4/5/16

to

This patch introduces 'write_backward' bit to perf_event_attr, which
controls the direction of a ring buffer. After set, the corresponding
ring buffer is written from end to beginning. This feature is design to
support reading from overwritable ring buffer.

Ring buffer can be created by mapping a perf event fd. Kernel puts event
records into ring buffer, user programs like perf fetch them from
address returned by mmap(). To prevent racing between kernel and perf,
they communicate to each other through 'head' and 'tail' pointers.
Kernel maintains 'head' pointer, points it to the next free area (tail
of the last record). Perf maintains 'tail' pointer, points it to the
tail of last consumed record (record has already been fetched). Kernel
determines the available space in a ring buffer using these two

pointers to avoid overwrite unfetched records.

By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
Different from normal ring buffer, perf is unable to maintain 'tail'
pointer because writing is forbidden. Therefore, for this type of ring
buffers, kernel overwrite old records unconditionally, works like flight
recorder. This feature would be useful if reading from overwritable ring
buffer were as easy as reading from normal ring buffer. However,
there's an obscure problem.

The following figure demonstrates a full overwritable ring buffer. In

this figure, the 'head' pointer points to the end of last record, and a
long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
would have pointed to position (X), so kernel knows there's no more
space in the ring buffer. However, for an overwritable ring buffer,

kernel ignore the 'tail' pointer.

(X) head
. |
. V
+------+-------+----------+------+---+
|A....A|B.....B|C........C|D....D| |
+------+-------+----------+------+---+

Record 'A' is overwritten by event 'E':

head
|
V
+--+---+-------+----------+------+---+
|.E|..A|B.....B|C........C|D....D|E..|
+--+---+-------+----------+------+---+

Now perf decides to read from this ring buffer. However, none of these

two natural positions, 'head' and the start of this ring buffer, are

pointing to the head of a record. Even the full ring buffer can be
accessed by perf, it is unable to find a position to start decoding.

The first attempt tries to solve this problem AFAIK can be found from
[1]. It makes kernel to maintain 'tail' pointer: updates it when ring
buffer is half full. However, this approach introduces overhead to
fast path. Test result shows a 1% overhead [2]. In addition, this method
utilizes no more tham 50% records.

Another attempt can be found from [3], which allows putting the size of

an event at the end of each record. This approach allows perf to find
records in a backword manner from 'head' pointer by reading size of a
record from its tail. However, because of alignment requirement, it
needs 8 bytes to record the size of a record, which is a huge waste. Its
performance is also not good, because more data need to be written.
This approach also introduces some extra branch instructions to fast
path.

'write_backward' is a better solution to this problem.

Following figure demonstrates the state of the overwritable ring buffer
when 'write_backward' is set before overwriting:

head
|
V
+---+------+----------+-------+------+
| |D....D|C........C|B.....B|A....A|
+---+------+----------+-------+------+

and after overwriting:
head
|
V
+---+------+----------+-------+---+--+
|..E|D....D|C........C|B.....B|A..|E.|
+---+------+----------+-------+---+--+

In each situation, 'head' points to the beginning of the newest record.

From this record, perf can iterate over the full ring buffer and fetch
records one by one.

The only limitation needs to consider is back-to-back reading. Due to
the non-deterministic of user program, it is impossible to ensure the
ring buffer keeps stable during reading. Consider an extreme situation:
perf is scheduled out after reading record 'D', then a burst of events

come, eat up the whole ring buffer (one or multiple rounds). When perf
comes back, reading after 'D' is incorrect now.

To prevent this problem, we need to find a way to ensure the ring buffer
is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
suggested because its overhead is lower than
ioctl(PERF_EVENT_IOC_ENABLE).

By carefully verifying 'header' pointer, reader can avoid pausing the
ring-buffer. For example:

/* A union of all possible events */
union perf_event event;

p = head = perf_mmap__read_head();
while (true) {
/* copy header of next event */
fetch(&event.header, p, sizeof(event.header));

/* read 'head' pointer */
head = perf_mmap__read_head();

/* check overwritten: is the header good? */
if (!verify(sizeof(event.header), p, head))
break;

/* copy the whole event */
fetch(&event, p, event.header.size);

/* read 'head' pointer again */
head = perf_mmap__read_head();

/* is the whole event good? */
if (!verify(event.header.size, p, head))
break;
p += event.header.size;
}

However, the overhead is high because:

a) In-place decoding is not safe.
Copying-verifying-decoding is required.

b) Fetching 'head' pointer requires additional synchronization.

(From Alexei Starovoitov:

Even this trick work, pause is needed for more than stability of

reading. When we collect the events into overwrite buffer we're waiting
for some other trigger (like all cpu utilization spike or just one cpu
running and all others are idle) and when it happens the buffer has
valuable info from the past. At this point new events are no longer
interesting and buffer should be paused, events read and unpaused until

next trigger comes.)

Acked-by: Alexei Starovoitov <a...@kernel.org>
Cc: He Kuang <hek...@huawei.com>

Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Brendan Gregg <brendan...@gmail.com>
Cc: Jiri Olsa <jo...@kernel.org>
Cc: Masami Hiramatsu <masami.hi...@hitachi.com>
Cc: Namhyung Kim <namh...@kernel.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Zefan Li <liz...@huawei.com>
Cc: pi3o...@163.com
---
include/linux/perf_event.h | 28 +++++++++++++++++++++---
include/uapi/linux/perf_event.h | 3 ++-
kernel/events/core.c | 48 ++++++++++++++++++++++++++++++++++++-----

kernel/events/ring_buffer.c | 16 +++++++++++++-
4 files changed, 85 insertions(+), 10 deletions(-)

context_switch : 1, /* context switch data */

- __reserved_1 : 37;

+ write_backward : 1, /* Write ring buffer from end to beginning */

+ __reserved_1 : 36;

union {
__u32 wakeup_events; /* wakeup every n events */

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8c3b35f..263a9d8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5693,9 +5693,13 @@ void perf_prepare_sample(struct perf_event_header *header,

}
}

-void perf_event_output(struct perf_event *event,
- struct perf_sample_data *data,
- struct pt_regs *regs)
+static void __always_inline
+__perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs,
+ int (*output_begin)(struct perf_output_handle *,
+ struct perf_event *,
+ unsigned int))
{
struct perf_output_handle handle;
struct perf_event_header header;

@@ -5705,7 +5709,7 @@ void perf_event_output(struct perf_event *event,

perf_prepare_sample(&header, data, event, regs);

- if (perf_output_begin(&handle, event, header.size))
+ if (output_begin(&handle, event, header.size))
goto exit;

perf_output_sample(&handle, &header, data, event);

@@ -5716,6 +5720,30 @@ exit:

@@ -8152,8 +8180,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,

if (overflow_handler) {
event->overflow_handler = overflow_handler;
event->overflow_handler_context = context;
+ } else if (is_write_backward(event)){
+ event->overflow_handler = perf_event_output_backward;
+ event->overflow_handler_context = NULL;
} else {
- event->overflow_handler = perf_event_output;
+ event->overflow_handler = perf_event_output_forward;
event->overflow_handler_context = NULL;
}

@@ -8388,6 +8419,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)

goto out;

/*
+ * Either writing ring buffer from beginning or from end.
+ * Mixing is not allowed.
+ */
+ if (is_write_backward(output_event) != is_write_backward(event))
+ goto out;
+
+ /*
* If both events generate aux data, they must be on the same PMU
*/
if (has_aux(event) && has_aux(output_event) &&
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c

index 60be55a..c49bab4 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -230,10 +230,24 @@ out:

return -ENOSPC;
}

+int perf_output_begin_forward(struct perf_output_handle *handle,
+ struct perf_event *event, unsigned int size)
+{
+ return __perf_output_begin(handle, event, size, false);
+}
+
+int perf_output_begin_backward(struct perf_output_handle *handle,

+ struct perf_event *event, unsigned int size)
+{

+ return __perf_output_begin(handle, event, size, true);

+}
+

int perf_output_begin(struct perf_output_handle *handle,
struct perf_event *event, unsigned int size)
{

- return __perf_output_begin(handle, event, size, false);
+

+ return __perf_output_begin(handle, event, size,

+ unlikely(is_write_backward(event)));

}

unsigned int perf_output_copy(struct perf_output_handle *handle,

--
1.8.3.4

Wangnan (F)

unread,

Apr 7, 2016, 5:50:07 AM4/7/16

to

On 2016/3/29 22:04, Peter Zijlstra wrote:

> On Mon, Mar 28, 2016 at 06:41:32AM +0000, Wang Nan wrote:
>
> Could you maybe write a perf/tests thingy for this so that _some_
> userspace exists that exercises this new code?
>
>

Yes. Please see:

http://lkml.kernel.org/r/1460022180-61262-1-gi...@huawei.com

Thank you.

Wangnan (F)

unread,

Apr 12, 2016, 10:40:06 PM4/12/16

to

Hi Peter,

When would you consider collection this patch? You have
collected all dependencies, this is the last patch to enable
the backward attribute. I have already posted a 'perf test'
testcase for it so there will be some usable user space code.

I will start posting my perf side code again after perf/core
support backward attribute. There are about 34 patches
now.

Thank you.

tip-bot for Wang Nan

unread,

Apr 23, 2016, 9:00:14 AM4/23/16

to

Commit-ID: 9ecda41acb971ebd07c8fb35faf24005c0baea12
Gitweb: http://git.kernel.org/tip/9ecda41acb971ebd07c8fb35faf24005c0baea12
Author: Wang Nan <wang...@huawei.com>
AuthorDate: Tue, 5 Apr 2016 14:11:18 +0000
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Sat, 23 Apr 2016 14:12:39 +0200

perf/core: Add ::write_backward attribute to perf event

This patch introduces 'write_backward' bit to perf_event_attr, which
controls the direction of a ring buffer. After set, the corresponding
ring buffer is written from end to beginning. This feature is design to
support reading from overwritable ring buffer.

Ring buffer can be created by mapping a perf event fd. Kernel puts event

records into ring buffer, user tooling like perf fetch them from
address returned by mmap(). To prevent racing between kernel and tooling,

they communicate to each other through 'head' and 'tail' pointers.
Kernel maintains 'head' pointer, points it to the next free area (tail

of the last record). Tooling maintains 'tail' pointer, points it to the

tail of last consumed record (record has already been fetched). Kernel
determines the available space in a ring buffer using these two
pointers to avoid overwrite unfetched records.

By mapping without 'PROT_WRITE', an overwritable ring buffer is created.

Different from normal ring buffer, tooling is unable to maintain 'tail'

pointer because writing is forbidden. Therefore, for this type of ring
buffers, kernel overwrite old records unconditionally, works like flight
recorder. This feature would be useful if reading from overwritable ring
buffer were as easy as reading from normal ring buffer. However,
there's an obscure problem.

The following figure demonstrates a full overwritable ring buffer. In
this figure, the 'head' pointer points to the end of last record, and a
long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
would have pointed to position (X), so kernel knows there's no more
space in the ring buffer. However, for an overwritable ring buffer,
kernel ignore the 'tail' pointer.

(X) head
. |
. V
+------+-------+----------+------+---+
|A....A|B.....B|C........C|D....D| |
+------+-------+----------+------+---+

Record 'A' is overwritten by event 'E':

head
|
V
+--+---+-------+----------+------+---+
|.E|..A|B.....B|C........C|D....D|E..|
+--+---+-------+----------+------+---+

Now tooling decides to read from this ring buffer. However, none of these

two natural positions, 'head' and the start of this ring buffer, are
pointing to the head of a record. Even the full ring buffer can be

accessed by tooling, it is unable to find a position to start decoding.

The first attempt tries to solve this problem AFAIK can be found from
[1]. It makes kernel to maintain 'tail' pointer: updates it when ring
buffer is half full. However, this approach introduces overhead to
fast path. Test result shows a 1% overhead [2]. In addition, this method
utilizes no more tham 50% records.

Another attempt can be found from [3], which allows putting the size of

an event at the end of each record. This approach allows tooling to find
records in a backward manner from 'head' pointer by reading size of a

record from its tail. However, because of alignment requirement, it
needs 8 bytes to record the size of a record, which is a huge waste. Its
performance is also not good, because more data need to be written.
This approach also introduces some extra branch instructions to fast
path.

'write_backward' is a better solution to this problem.

Following figure demonstrates the state of the overwritable ring buffer
when 'write_backward' is set before overwriting:

head
|
V
+---+------+----------+-------+------+
| |D....D|C........C|B.....B|A....A|
+---+------+----------+-------+------+

and after overwriting:
head
|
V
+---+------+----------+-------+---+--+
|..E|D....D|C........C|B.....B|A..|E.|
+---+------+----------+-------+---+--+

In each situation, 'head' points to the beginning of the newest record.

From this record, tooling can iterate over the full ring buffer and fetch
records one by one.

The only limitation that needs to be considered is back-to-back reading.
Due to the non-deterministic of user programs, it is impossible to ensure

the ring buffer keeps stable during reading. Consider an extreme situation:

tooling is scheduled out after reading record 'D', then a burst of events
come, eat up the whole ring buffer (one or multiple rounds). When the
tooling process comes back, reading after 'D' is incorrect now.

Even when this trick works, pause is needed for more than stability of

reading. When we collect the events into overwrite buffer we're waiting
for some other trigger (like all cpu utilization spike or just one cpu
running and all others are idle) and when it happens the buffer has
valuable info from the past. At this point new events are no longer
interesting and buffer should be paused, events read and unpaused until
next trigger comes.)

This patch utilizes event's default overflow_handler introduced
previously. perf_event_output_backward() is created as the default
overflow handler for backward ring buffers. To avoid extra overhead to
fast path, original perf_event_output() becomes __perf_event_output()
and marked '__always_inline'. In theory, there's no extra overhead
introduced to fast path.

Performance testing:

Calling 3000000 times of 'close(-1)', use gettimeofday() to check
duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
system calls. In ns.

Testing environment:

CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Kernel : v4.5.0
MEAN STDVAR
BASE 800214.950 2853.083
PRE1 2253846.700 9997.014
PRE2 2257495.540 8516.293
POST 2250896.100 8933.921

Where 'BASE' is pure performance without capturing. 'PRE1' is test
result of pure 'v4.5.0' kernel. 'PRE2' is test result before this

patch. 'POST' is test result after this patch. See [4] for the detailed

experimental setup.

Considering the stdvar, this patch doesn't introduce performance

overhead to the fast path.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
[2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
[3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
[4] http://lkml.kernel.org/g/56F89DCD...@huawei.com

Signed-off-by: Wang Nan <wang...@huawei.com>

Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Acked-by: Alexei Starovoitov <a...@kernel.org>
Cc: <ac...@kernel.org>
Cc: <pi3o...@163.com>
Cc: Alexander Shishkin <alexander...@linux.intel.com>

Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Brendan Gregg <brendan...@gmail.com>

Cc: He Kuang <hek...@huawei.com>
Cc: Jiri Olsa <jo...@kernel.org>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>

Cc: Masami Hiramatsu <masami.hi...@hitachi.com>
Cc: Namhyung Kim <namh...@kernel.org>
Cc: Peter Zijlstra <pet...@infradead.org>

Cc: Stephane Eranian <era...@google.com>
Cc: Thomas Gleixner <tg...@linutronix.de>
Cc: Vince Weaver <vincent...@maine.edu>
Cc: Zefan Li <liz...@huawei.com>
Link: http://lkml.kernel.org/r/1459865478-53413-1-gi...@huawei.com
[ Fixed the changelog some more. ]
Signed-off-by: Ingo Molnar <mi...@kernel.org>

Signed-off-by: Ingo Molnar <mi...@kernel.org>

---
include/linux/perf_event.h | 28 +++++++++++++++++++++---
include/uapi/linux/perf_event.h | 3 ++-
kernel/events/core.c | 48 ++++++++++++++++++++++++++++++++++++-----
kernel/events/ring_buffer.c | 16 +++++++++++++-
4 files changed, 85 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h

index b8b195f..85749ae 100644

@@ -1051,8 +1061,20 @@ static inline bool has_aux(struct perf_event *event)

index 21ba024..eabeb2a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5694,9 +5694,13 @@ void perf_prepare_sample(struct perf_event_header *header,

}
}

-void perf_event_output(struct perf_event *event,
- struct perf_sample_data *data,
- struct pt_regs *regs)
+static void __always_inline
+__perf_event_output(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs,
+ int (*output_begin)(struct perf_output_handle *,
+ struct perf_event *,
+ unsigned int))
{
struct perf_output_handle handle;
struct perf_event_header header;

@@ -5706,7 +5710,7 @@ void perf_event_output(struct perf_event *event,

perf_prepare_sample(&header, data, event, regs);

- if (perf_output_begin(&handle, event, header.size))
+ if (output_begin(&handle, event, header.size))
goto exit;

perf_output_sample(&handle, &header, data, event);

@@ -5717,6 +5721,30 @@ exit:

@@ -8153,8 +8181,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,

if (overflow_handler) {
event->overflow_handler = overflow_handler;
event->overflow_handler_context = context;
+ } else if (is_write_backward(event)){
+ event->overflow_handler = perf_event_output_backward;
+ event->overflow_handler_context = NULL;
} else {
- event->overflow_handler = perf_event_output;
+ event->overflow_handler = perf_event_output_forward;
event->overflow_handler_context = NULL;
}

@@ -8389,6 +8420,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)

Michael Kerrisk (man-pages)

unread,

Oct 21, 2016, 5:00:06 AM10/21/16

to

Thanks for this patch, Wangnan.

Vince, do you have any comments?

Cheers,

Michael

Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/