use-after-free in __perf_install_in_context

170 views
Skip to first unread message

Dmitry Vyukov

unread,
Dec 4, 2015, 3:04:55 PM12/4/15
to Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
Hello,

While running syzkaller fuzzer I am seeing lots of the following
use-after-free reports. Unfortunately all my numerous attempts to
reproduce them in a controlled environment failed. They pop up during
fuzzing periodically (once in several hours in a single VM), but
whenever I try to stress-replay what happened in the VM before the
report, the use-after-free does not reproduce. Can somebody
knowledgeable in perf subsystem look at the report? Maybe it is
possible to figure out what happened based purely on the report. I can
pretty reliably test any proposed fixes.
All reports look like this one. Then it is usually followed by other
reports and eventually kernel hangs or dies. What happens in the
fuzzer is essentially random syscalls with random arguments, tasks
born and die concurrently and so on. I was able to reproduce it by
restricting syscalls only to perf_event_open, perf ioctls and bpf
syscall.


==================================================================
BUG: KASAN: use-after-free in __lock_acquire+0x4e99/0x5100 at addr
ffff880038706e60
Read of size 8 by task syzkaller_execu/6513
=============================================================================
BUG kmalloc-1024 (Not tainted): kasan: bad access detected
-----------------------------------------------------------------------------

Disabling lock debugging due to kernel taint
INFO: Allocated in alloc_perf_context+0x4c/0x100 age=263 cpu=1 pid=6428
[< inline >] kzalloc include/linux/slab.h:602
[< none >] alloc_perf_context+0x4c/0x100 kernel/events/core.c:3399
[< none >] find_get_context+0x187/0x830 kernel/events/core.c:3506
[< none >] SYSC_perf_event_open+0xe50/0x21a0 kernel/events/core.c:8375
[< none >] SyS_perf_event_open+0x39/0x50 kernel/events/core.c:8236
[< none >] tracesys_phase2+0x88/0x8d arch/x86/entry/entry_64.S:269

INFO: Freed in free_ctx+0x4b/0x70 age=174 cpu=2 pid=8105
[< none >] kfree+0x26f/0x3e0 mm/slub.c:3632
[< none >] free_ctx+0x4b/0x70 kernel/events/core.c:872
[< inline >] __rcu_reclaim kernel/rcu/rcu.h:118
[< inline >] rcu_do_batch kernel/rcu/tree.c:2693
[< inline >] invoke_rcu_callbacks kernel/rcu/tree.c:2961
[< inline >] __rcu_process_callbacks kernel/rcu/tree.c:2928
[< none >] rcu_process_callbacks+0x631/0x19e0 kernel/rcu/tree.c:2945
[< none >] __do_softirq+0x2e5/0xb40 kernel/softirq.c:273
[< inline >] invoke_softirq kernel/softirq.c:350
[< none >] irq_exit+0x165/0x1e0 kernel/softirq.c:391
[< inline >] exiting_irq ./arch/x86/include/asm/apic.h:653
[< none >] smp_apic_timer_interrupt+0x88/0xc0
arch/x86/kernel/apic/apic.c:926
[< none >] apic_timer_interrupt+0x87/0x90
arch/x86/entry/entry_64.S:678

INFO: Slab 0xffffea0000e1c000 objects=24 used=16 fp=0xffff880038706e40
flags=0x1fffc0000004080
INFO: Object 0xffff880038706e40 @offset=28224 fp=0xffff8800387078c0
CPU: 1 PID: 6513 Comm: syzkaller_execu Tainted: G B 4.4.0-rc3+ #144
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
0000000000000001 ffff8800641ff680 ffffffff82c74978 0000000041b58ab3
ffffffff878cbafd ffffffff82c748c6 ffff88006459c380 ffffffff878ec293
ffff88003e806f80 0000000000000008 ffff880038706e40 ffff8800641ff680

Call Trace:
[<ffffffff81798654>] __asan_report_load8_noabort+0x54/0x70
mm/kasan/report.c:280
[<ffffffff814097e9>] __lock_acquire+0x4e99/0x5100 kernel/locking/lockdep.c:3092
[<ffffffff8140c36d>] lock_acquire+0x19d/0x3f0 kernel/locking/lockdep.c:3585
[< inline >] __raw_spin_lock include/linux/spinlock_api_smp.h:144
[<ffffffff8691aab1>] _raw_spin_lock+0x31/0x40 kernel/locking/spinlock.c:151
[< inline >] perf_ctx_lock kernel/events/core.c:351
[<ffffffff81638db9>] __perf_install_in_context+0x109/0xa00
kernel/events/core.c:2074
[<ffffffff816230da>] remote_function+0x14a/0x200 kernel/events/core.c:74
[<ffffffff814c9db7>] generic_exec_single+0x2a7/0x490 kernel/smp.c:156
[<ffffffff814ca980>] smp_call_function_single+0x200/0x310 kernel/smp.c:300
[<ffffffff816214f3>] task_function_call+0x123/0x160 kernel/events/core.c:101
[<ffffffff81629511>] perf_install_in_context+0x201/0x340
kernel/events/core.c:2155
[<ffffffff8164dac5>] SYSC_perf_event_open+0x1465/0x21a0
kernel/events/core.c:8540
[<ffffffff81656c29>] SyS_perf_event_open+0x39/0x50 kernel/events/core.c:8236
[<ffffffff8691b9f8>] tracesys_phase2+0x88/0x8d arch/x86/entry/entry_64.S:269
==================================================================

On commit 31ade3b83e1821da5fbb2f11b5b3d4ab2ec39db8.

Thank you

Alexei Starovoitov

unread,
Dec 4, 2015, 3:32:24 PM12/4/15
to Dmitry Vyukov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Fri, Dec 04, 2015 at 09:04:35PM +0100, Dmitry Vyukov wrote:
> Hello,
>
> While running syzkaller fuzzer I am seeing lots of the following
> use-after-free reports. Unfortunately all my numerous attempts to
> reproduce them in a controlled environment failed. They pop up during
> fuzzing periodically (once in several hours in a single VM), but
> whenever I try to stress-replay what happened in the VM before the
> report, the use-after-free does not reproduce. Can somebody
> knowledgeable in perf subsystem look at the report? Maybe it is
> possible to figure out what happened based purely on the report. I can
> pretty reliably test any proposed fixes.
> All reports look like this one. Then it is usually followed by other
> reports and eventually kernel hangs or dies. What happens in the
> fuzzer is essentially random syscalls with random arguments, tasks
> born and die concurrently and so on. I was able to reproduce it by
> restricting syscalls only to perf_event_open, perf ioctls and bpf
> syscall.

For the sake of trying to narrow it down:
does the error disappear when you stop using bpf syscall in your fuzzing?
If yes, then I could have missed some interaction between perf_event_free,
kprobe free and bpf_prog_free.
There was a race there before.
May be there is still something else.

Dmitry Vyukov

unread,
Dec 4, 2015, 4:00:41 PM12/4/15
to Alexei Starovoitov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
It is a good question. I will test it.

Dmitry Vyukov

unread,
Dec 7, 2015, 6:05:02 AM12/7/15
to Alexei Starovoitov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
Testing without bpf, so far hit the following warning on a non-tainted kernel:


------------[ cut here ]------------
WARNING: CPU: 0 PID: 26259 at kernel/events/core.c:2696
task_ctx_sched_out+0x12c/0x1c0()
Modules linked in:
CPU: 0 PID: 26259 Comm: syzkaller_execu Not tainted 4.4.0-rc3+ #150
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
0000000000000000 ffff88006579f9a8 ffffffff82c6f2a8 0000000041b58ab3
ffffffff8788bf8d ffffffff82c6f1f6 ffff88003dd32d00 ffff88003469b9c0
ffffffff81626130 ffff88006579fa10 0000000000000000 0000000000006693
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff82c6f2a8>] dump_stack+0xb2/0xfa lib/dump_stack.c:50
[<ffffffff812ca446>] warn_slowpath_common+0xe6/0x170 kernel/panic.c:460
[<ffffffff812ca699>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
[<ffffffff8163302c>] task_ctx_sched_out+0x12c/0x1c0 kernel/events/core.c:2696
[< inline >] perf_event_exit_task_context kernel/events/core.c:8815
[<ffffffff8165669a>] perf_event_exit_task+0x27a/0xae0 kernel/events/core.c:8881
[<ffffffff812d4a12>] do_exit+0x892/0x3050 kernel/exit.c:759
[<ffffffff812d732c>] do_group_exit+0xec/0x390 kernel/exit.c:880
[<ffffffff81302177>] get_signal+0x677/0x1bf0 kernel/signal.c:2307
[<ffffffff8118645e>] do_signal+0x7e/0x2170 arch/x86/kernel/signal.c:709
[<ffffffff81003a1e>] exit_to_usermode_loop+0xfe/0x1e0
arch/x86/entry/common.c:247
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282
[<ffffffff8100733b>] syscall_return_slowpath+0x16b/0x240
arch/x86/entry/common.c:344
[<ffffffff868dafe2>] int_ret_from_sys_call+0x25/0x9f
arch/x86/entry/entry_64.S:281
---[ end trace c5e9cee9f12b7d5f ]---



static void task_ctx_sched_out(struct perf_event_context *ctx)
{
struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);

if (!cpuctx->task_ctx)
return;

if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
return;

ctx_sched_out(ctx, cpuctx, EVENT_ALL);
cpuctx->task_ctx = NULL;
}

Dmitry Vyukov

unread,
Dec 7, 2015, 6:06:46 AM12/7/15
to Alexei Starovoitov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
Also this one. I saw them regularly, but I though that they are
induced by the use-after-frees, but now I see them on a non-tainted
kernel:


------------[ cut here ]------------
WARNING: CPU: 1 PID: 25357 at arch/x86/kernel/hw_breakpoint.c:160
arch_uninstall_hw_breakpoint+0x1e1/0x2a0()
Can't find any breakpoint slot
Modules linked in:
CPU: 1 PID: 25357 Comm: syzkaller_execu Not tainted 4.4.0-rc3+ #150
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
0000000000000001 ffff8800333c74e8 ffffffff82c6f2a8 0000000041b58ab3
ffffffff8788bf8d ffffffff82c6f1f6 ffffffff8696a5e0 ffff8800333c7560
ffffffff8696a620 00000000000000a0 0000000000000009 ffff8800333c74c8
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff82c6f2a8>] dump_stack+0xb2/0xfa lib/dump_stack.c:50
[<ffffffff812ca446>] warn_slowpath_common+0xe6/0x170 kernel/panic.c:460
[<ffffffff812ca579>] warn_slowpath_fmt+0xa9/0xd0 kernel/panic.c:472
[<ffffffff811a1c31>] arch_uninstall_hw_breakpoint+0x1e1/0x2a0
arch/x86/kernel/hw_breakpoint.c:160
[<ffffffff8165e1b5>] hw_breakpoint_del+0x15/0x20
kernel/events/hw_breakpoint.c:595
[<ffffffff81631aa4>] event_sched_out+0x704/0x1270 kernel/events/core.c:1580
[<ffffffff81632681>] group_sched_out+0x71/0x1f0 kernel/events/core.c:1606
[<ffffffff81632d02>] ctx_sched_out+0x502/0x700 kernel/events/core.c:2414
[< inline >] perf_event_context_sched_out kernel/events/core.c:2593
[<ffffffff81645524>] __perf_event_task_sched_out+0x4d4/0x1800
kernel/events/core.c:2678
[< inline >] perf_event_task_sched_out include/linux/perf_event.h:946
[< inline >] prepare_task_switch kernel/sched/core.c:2489
[< inline >] context_switch kernel/sched/core.c:2647
[<ffffffff868c986f>] __schedule+0xa0f/0x1f40 kernel/sched/core.c:3156
[<ffffffff868cae0d>] schedule+0x6d/0x210 kernel/sched/core.c:3185
[< inline >] freezable_schedule include/linux/freezer.h:171
[<ffffffff814c09c6>] futex_wait_queue_me+0x306/0x680 kernel/futex.c:2188
[<ffffffff814c238c>] futex_wait+0x22c/0x6b0 kernel/futex.c:2303
[<ffffffff814c6c3d>] do_futex+0x20d/0x19d0 kernel/futex.c:3067
[< inline >] SYSC_futex kernel/futex.c:3127
[<ffffffff814c8533>] SyS_futex+0x133/0x2b0 kernel/futex.c:3095
[<ffffffff868dafb8>] tracesys_phase2+0x88/0x8d arch/x86/entry/entry_64.S:269
---[ end trace 624ab9515f02dd2b ]---

Dmitry Vyukov

unread,
Dec 7, 2015, 6:24:32 AM12/7/15
to Alexei Starovoitov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
Hit exactly the same use-after-free report without using bpf (just
perf_event_open and perf ioctl's).

Peter Zijlstra

unread,
Dec 7, 2015, 10:36:36 AM12/7/15
to Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Fri, Dec 04, 2015 at 09:04:35PM +0100, Dmitry Vyukov wrote:
So it would be _awesome_ if we could somehow extend this callchain to
include the site that calls call_rcu().

In any case, this looks like we got the refcounting wrong some place,
I'll go audit that stuff.

Dmitry Vyukov

unread,
Dec 7, 2015, 11:09:42 AM12/7/15
to Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
We have a patch for KASAN in works that adds so-called stack depot
which allows to map a stack trace onto uint32 id. Then we can plumb
these ids not only to memory blocks, but also to rcu callbacks, thread
pool items, etc. But it's not yet ready.
If your audit does not give any results, can you give me a patch that
prints rcu callback submission stacks in KASAN reports?

Alexei Starovoitov

unread,
Dec 7, 2015, 10:24:09 PM12/7/15
to Dmitry Vyukov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Mon, Dec 07, 2015 at 05:09:21PM +0100, Dmitry Vyukov wrote:
> > So it would be _awesome_ if we could somehow extend this callchain to
> > include the site that calls call_rcu().
>
> We have a patch for KASAN in works that adds so-called stack depot
> which allows to map a stack trace onto uint32 id. Then we can plumb

I was hacking something similar to categorize stack traces with u32 id.
How are you planning to limit the number of such stack traces ?
and what is the interface for user space to get stack trace from an id?

Dmitry Vyukov

unread,
Dec 8, 2015, 11:12:24 AM12/8/15
to Alexei Starovoitov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
We don't limit number of stack traces. Kernel does not seem to use
data-driven recursion extensively, so there is limited number of
stacks. Though, probably we will need to strip non-interrupt part for
interrupt stacks, otherwise that can produce unbounded number of
different stacks.
There is no interface for user-space, it is used only inside of kernel
to save stacks for memory blocks (rcu callbacks, thread pool items in
the future).
The design is based on what we successfully and extensively use in
user-space sanitizers for years. Current code is here:
https://github.com/ramosian-glider/kasan/commit/fb0eefd212366401ed5ad244233ef379a27bfb46

Peter Zijlstra

unread,
Dec 8, 2015, 11:22:31 AM12/8/15
to Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Fri, Dec 04, 2015 at 09:04:35PM +0100, Dmitry Vyukov wrote:
> Hello,
>
> While running syzkaller fuzzer I am seeing lots of the following
> use-after-free reports. Unfortunately all my numerous attempts to
> reproduce them in a controlled environment failed. They pop up during
> fuzzing periodically (once in several hours in a single VM), but
> whenever I try to stress-replay what happened in the VM before the
> report, the use-after-free does not reproduce.

Typical that :/

> Can somebody knowledgeable in perf subsystem look at the report? Maybe
> it is possible to figure out what happened based purely on the report.
> I can pretty reliably test any proposed fixes.

So I'm still going over the code, but meanwhile I tried reproducing this
using the perf_fuzzer and some debug code, but no luck with that.

Since you seem to be able to reproduce, could you do a run with the
below patch in to see if it tickles something?


---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 36babfd..58e504c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -872,13 +872,22 @@ static void free_ctx(struct rcu_head *head)
kfree(ctx);
}

+#define for_each_task_context_nr(ctxn) \
+ for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
+
static void put_ctx(struct perf_event_context *ctx)
{
if (atomic_dec_and_test(&ctx->refcount)) {
if (ctx->parent_ctx)
put_ctx(ctx->parent_ctx);
- if (ctx->task)
+ if (ctx->task) {
+ int ctxn;
+
+ for_each_task_context_nr(ctxn)
+ WARN_ON(ctx->task->perf_event_ctxp[ctxn] == ctx);
+
put_task_struct(ctx->task);
+ }
call_rcu(&ctx->rcu_head, free_ctx);
}
}
@@ -2649,9 +2658,6 @@ static void perf_pmu_sched_task(struct task_struct *prev,
static void perf_event_switch(struct task_struct *task,
struct task_struct *next_prev, bool sched_in);

-#define for_each_task_context_nr(ctxn) \
- for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
-
/*
* Called from scheduler to remove the events of the current task,
* with interrupts disabled.
@@ -8931,6 +8937,8 @@ again:

mutex_unlock(&ctx->mutex);

+ task->perf_event_ctxp[ctxn] = NULL;
+
put_ctx(ctx);
}
}

Peter Zijlstra

unread,
Dec 8, 2015, 11:27:08 AM12/8/15
to Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Fri, Dec 04, 2015 at 09:04:35PM +0100, Dmitry Vyukov wrote:

> I was able to reproduce it by
> restricting syscalls only to perf_event_open, perf ioctls and bpf
> syscall.

Does that include cpu hotplug?

Sasha just reminded me there is a swevent use-after-free issue on
hotplug.

Peter Zijlstra

unread,
Dec 8, 2015, 11:44:14 AM12/8/15
to Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Mon, Dec 07, 2015 at 05:09:21PM +0100, Dmitry Vyukov wrote:
> If your audit does not give any results, can you give me a patch that
> prints rcu callback submission stacks in KASAN reports?

Just because my brain is fried for today, I figured I'd give it a go.

Completely untested..

---
include/linux/slub_def.h | 2 ++
kernel/rcu/tree_plugin.h | 1 +
mm/slub.c | 34 ++++++++++++++++++++++++++++++++--
3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 33885118523c..445b586c0bfa 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -129,4 +129,6 @@ static inline void *virt_to_obj(struct kmem_cache *s,
void object_err(struct kmem_cache *s, struct page *page,
u8 *object, char *reason);

+void object_set_indirect(const void *addr);
+
#endif /* _LINUX_SLUB_DEF_H */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 630c19772630..4e1e79e01e34 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -659,6 +659,7 @@ static void rcu_preempt_do_callbacks(void)
*/
void call_rcu(struct rcu_head *head, rcu_callback_t func)
{
+ object_set_indirect(head);
__call_rcu(head, func, rcu_state_p, -1, 0);
}
EXPORT_SYMBOL_GPL(call_rcu);
diff --git a/mm/slub.c b/mm/slub.c
index 46997517406e..6977dc7cffcd 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -196,7 +196,12 @@ struct track {
unsigned long when; /* When did the operation occur */
};

-enum track_item { TRACK_ALLOC, TRACK_FREE };
+enum track_item {
+ TRACK_ALLOC = 0,
+ TRACK_FREE,
+ TRACK_INDIRECT,
+ TRACK_NR,
+};

#ifdef CONFIG_SYSFS
static int sysfs_slab_add(struct kmem_cache *);
@@ -551,6 +556,7 @@ static void init_tracking(struct kmem_cache *s, void *object)

set_track(s, object, TRACK_FREE, 0UL);
set_track(s, object, TRACK_ALLOC, 0UL);
+ set_track(s, object, TRACK_INDIRECT, 0UL);
}

static void print_track(const char *s, struct track *t)
@@ -579,6 +585,7 @@ static void print_tracking(struct kmem_cache *s, void *object)

print_track("Allocated", get_track(s, object, TRACK_ALLOC));
print_track("Freed", get_track(s, object, TRACK_FREE));
+ print_track("Indirect", get_track(s, object, TRACK_INDIRECT));
}

static void print_page_info(struct page *page)
@@ -652,6 +659,29 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
dump_stack();
}

+void object_set_indirect(const void *addr)
+{
+ if ((addr >= (void *)PAGE_OFFSET) &&
+ (addr < high_memory)) {
+ struct page *page = virt_to_head_page(addr);
+
+ if (PageSlab(page)) {
+ void *object;
+ struct kmem_cache *cache = page->slab_cache;
+ void *last_object;
+
+ object = virt_to_obj(cache, page_address(page), addr);
+ last_object = page_address(page) +
+ page->objects * cache->size;
+
+ if (unlikely(object > last_object))
+ object = last_object; /* we hit into padding */
+
+ set_track(cache, object, TRACK_INDIRECT, (unsigned long)addr);
+ }
+ }
+}
+
void object_err(struct kmem_cache *s, struct page *page,
u8 *object, char *reason)
{
@@ -767,7 +797,7 @@ static int check_pad_bytes(struct kmem_cache *s, struct page *page, u8 *p)

if (s->flags & SLAB_STORE_USER)
/* We also have user information there */
- off += 2 * sizeof(struct track);
+ off += TRACK_NR * sizeof(struct track);

if (s->size == off)
return 1;

Dmitry Vyukov

unread,
Dec 8, 2015, 11:51:05 AM12/8/15
to Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Tue, Dec 8, 2015 at 5:27 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Fri, Dec 04, 2015 at 09:04:35PM +0100, Dmitry Vyukov wrote:
>
>> I was able to reproduce it by
>> restricting syscalls only to perf_event_open, perf ioctls and bpf
>> syscall.
>
> Does that include cpu hotplug?

I have CONFIG_HOTPLUG_CPU enabled. But I don't do actual plug/unplug.

Alexei Starovoitov

unread,
Dec 8, 2015, 12:54:41 PM12/8/15
to Dmitry Vyukov, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
why did you pick approach to never free accumulated stacks?
That limits usability a lot, since once kasan starts using it only
reboot will free the memory. ouch.
what worked for user space doesn't work for kernel.

Dmitry Vyukov

unread,
Dec 8, 2015, 12:56:44 PM12/8/15
to syzkaller, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Tue, Dec 8, 2015 at 6:54 PM, Alexei Starovoitov
Freeing and reusing will slow down and complicate code significantly.
And it is not yet proved to be necessary.

Alexei Starovoitov

unread,
Dec 8, 2015, 1:05:35 PM12/8/15
to Dmitry Vyukov, syzkaller, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
It's a joke, right? allocating kernel pages without ability to free?!

Dmitry Vyukov

unread,
Dec 8, 2015, 1:35:41 PM12/8/15
to Alexei Starovoitov, syzkaller, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Tue, Dec 8, 2015 at 7:05 PM, Alexei Starovoitov
The plan is to smash kernel much earlier than it will run out of memory.


I think this scheme will work pretty well.
I've counted 34403 calls to
kmalloc/kfree/kmem_cache_alloc/kmem_cache_free in kernel. Multiply
this by 2 to account for different stacks leading to the same
malloc/free and assuming that we store 16-byte header and 20 4-byte
frames, this gives us about 6MB. I can live with that. I can live with
60MB as well.

Ingo Molnar

unread,
Dec 8, 2015, 1:57:41 PM12/8/15
to Peter Zijlstra, Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin

* Peter Zijlstra <pet...@infradead.org> wrote:

> On Fri, Dec 04, 2015 at 09:04:35PM +0100, Dmitry Vyukov wrote:
> > Hello,
> >
> > While running syzkaller fuzzer I am seeing lots of the following
> > use-after-free reports. Unfortunately all my numerous attempts to
> > reproduce them in a controlled environment failed. They pop up during
> > fuzzing periodically (once in several hours in a single VM), but
> > whenever I try to stress-replay what happened in the VM before the
> > report, the use-after-free does not reproduce.
>
> Typical that :/
>
> > Can somebody knowledgeable in perf subsystem look at the report? Maybe
> > it is possible to figure out what happened based purely on the report.
> > I can pretty reliably test any proposed fixes.
>
> So I'm still going over the code, but meanwhile I tried reproducing this
> using the perf_fuzzer and some debug code, but no luck with that.
>
> Since you seem to be able to reproduce, could you do a run with the
> below patch in to see if it tickles something?

Btw., could we add more redundancy / debug code to the refcounting code? It seems
to be a frequent source of very hard to find/fix races/bugs - so it should be ripe
for some extra debug infrastructure ...

Thanks,


Ingo

Dmitry Vyukov

unread,
Dec 8, 2015, 2:15:18 PM12/8/15
to Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
Peter,

Tested with your patches.
The additional WARNING does not fire.
For the rcu stacks, I had to change two more 2's to TRACK_NR and also
moved memorization from call_rcu to __call_rcu, but now it is working.
Two reports with indirect stack:


BUG: KASAN: use-after-free in __lock_acquire+0x4e99/0x5100 at addr
ffff88003096c640
BUG kmalloc-1024 (Not tainted): kasan: bad access detected
-----------------------------------------------------------------------------

irq event stamp: 136
hardirqs last enabled at (135): [<ffffffff814c12e4>]
__raw_spin_lock_init+0x24/0x120 kernel/locking/spinlock_debug.c:23
hardirqs last disabled at (136): [<ffffffff8183c967>]
__slab_alloc+0x37/0x90 mm/slub.c:2487
softirqs last enabled at (38): [<ffffffff8138417d>]
__do_softirq+0x5ed/0xb40 kernel/softirq.c:299
softirqs last disabled at (31): [< inline >] invoke_softirq
kernel/softirq.c:350
softirqs last disabled at (31): [<ffffffff813849d5>]
irq_exit+0x165/0x1e0 kernel/softirq.c:391

INFO: Allocated in alloc_perf_context+0x4c/0x100 age=220 cpu=1 pid=15649
[< none >] ___slab_alloc+0x648/0x8c0 mm/slub.c:2468
[< none >] __slab_alloc+0x4c/0x90 mm/slub.c:2497
[< inline >] slab_alloc_node mm/slub.c:2560
[< inline >] slab_alloc mm/slub.c:2602
[< none >] kmem_cache_alloc_trace+0x23c/0x3f0 mm/slub.c:2619
[< inline >] kmalloc include/linux/slab.h:458
[< inline >] kzalloc include/linux/slab.h:602
[< none >] alloc_perf_context+0x4c/0x100 kernel/events/core.c:3405
[< none >] find_get_context+0x187/0x830 kernel/events/core.c:3512
[< none >] SYSC_perf_event_open+0xe50/0x21a0 kernel/events/core.c:8381
[< none >] SyS_perf_event_open+0x39/0x50 kernel/events/core.c:8242
[< none >] tracesys_phase2+0x88/0x8d arch/x86/entry/entry_64.S:269

INFO: Freed in free_ctx+0x4b/0x70 age=89 cpu=3 pid=21
[< none >] __slab_free+0x21e/0x3e0 mm/slub.c:2678
[< inline >] slab_free mm/slub.c:2833
[< none >] kfree+0x26f/0x3e0 mm/slub.c:3662
[< none >] free_ctx+0x4b/0x70 kernel/events/core.c:872
[< inline >] __rcu_reclaim kernel/rcu/rcu.h:118
[< inline >] rcu_do_batch kernel/rcu/tree.c:2693
[< inline >] invoke_rcu_callbacks kernel/rcu/tree.c:2961
[< inline >] __rcu_process_callbacks kernel/rcu/tree.c:2928
[< none >] rcu_process_callbacks+0x631/0x19e0 kernel/rcu/tree.c:2945
[< none >] __do_softirq+0x2e5/0xb40 kernel/softirq.c:273
[< none >] run_ksoftirqd+0x20/0x50 kernel/softirq.c:662
[< none >] smpboot_thread_fn+0x41d/0x850 kernel/smpboot.c:163
[< none >] kthread+0x21d/0x2e0 drivers/block/aoe/aoecmd.c:1314
[< none >] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:468

INFO: Indirect in __call_rcu.constprop.63+0x7e/0xe30 age=126 cpu=3 pid=15645
[< none >] __call_rcu.constprop.63+0x7e/0xe30 kernel/rcu/tree.c:3048
[< none >] call_rcu_sched+0x26/0x30 kernel/rcu/tree.c:3117
[< none >] put_ctx+0x150/0x250 kernel/events/core.c:891
[< inline >] __free_event kernel/events/core.c:3707
[< none >] _free_event+0x2c7/0xaa0 kernel/events/core.c:3738
[< none >] put_event+0x294/0x4d0 kernel/events/core.c:3834
[< none >] perf_release+0x3c/0x60 kernel/events/core.c:3849
[< none >] __fput+0x244/0x860 fs/file_table.c:208
[< none >] ____fput+0x15/0x20 fs/file_table.c:244
[< none >] task_work_run+0x130/0x240 kernel/task_work.c:115
[< inline >] exit_task_work include/linux/task_work.h:21
[< none >] do_exit+0x885/0x3050 kernel/exit.c:750
[< none >] do_group_exit+0xec/0x390 kernel/exit.c:880
[< none >] get_signal+0x677/0x1bf0 kernel/signal.c:2307
[< none >] do_signal+0x7e/0x2170 arch/x86/kernel/signal.c:709
[< none >] exit_to_usermode_loop+0xfe/0x1e0
arch/x86/entry/common.c:247
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282
[< none >] syscall_return_slowpath+0x16b/0x240
arch/x86/entry/common.c:344
[< none >] int_ret_from_sys_call+0x25/0x9f
arch/x86/entry/entry_64.S:281

INFO: Slab 0xffffea0000c25a00 objects=21 used=1 fp=0xffff88003096d7a8
flags=0x1fffc0000004080
INFO: Object 0xffff88003096c620 @offset=17952 fp=0x (null)
CPU: 1 PID: 15687 Comm: syzkaller_execu Tainted: G B
4.4.0-rc3+ #155
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
0000000000000001 ffff88003794f680 ffffffff82e0f8b8 0000000041b58ab3
ffffffff87a9a27d ffffffff82e0f806 ffff880033c62e80 ffffffff87abb3d1
ffff88003e806d00 0000000000000008 ffff88003096c620 ffff88003794f680

Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff82e0f8b8>] dump_stack+0xb2/0xfa lib/dump_stack.c:50
[<ffffffff818369dc>] print_trailer+0x12c/0x210 mm/slub.c:659
[<ffffffff81841a3f>] object_err+0x2f/0x40 mm/slub.c:689
[< inline >] print_address_description mm/kasan/report.c:138
[<ffffffff818457a9>] kasan_report_error+0x5d9/0x860 mm/kasan/report.c:251
[< inline >] kasan_report mm/kasan/report.c:274
[<ffffffff81845bd4>] __asan_report_load8_noabort+0x54/0x70
mm/kasan/report.c:295
[<ffffffff814b25f9>] __lock_acquire+0x4e99/0x5100 kernel/locking/lockdep.c:3092
[<ffffffff814b517d>] lock_acquire+0x19d/0x3f0 kernel/locking/lockdep.c:3585
[< inline >] __raw_spin_lock include/linux/spinlock_api_smp.h:144
[<ffffffff86a89591>] _raw_spin_lock+0x31/0x40 kernel/locking/spinlock.c:151
[< inline >] perf_ctx_lock kernel/events/core.c:351
[<ffffffff816e15d9>] __perf_install_in_context+0x109/0xa00
kernel/events/core.c:2083
[<ffffffff816cb7ba>] remote_function+0x14a/0x200 kernel/events/core.c:74
[<ffffffff81572787>] generic_exec_single+0x2a7/0x490 kernel/smp.c:156
[<ffffffff81573350>] smp_call_function_single+0x200/0x310 kernel/smp.c:300
[<ffffffff816c9bd3>] task_function_call+0x123/0x160 kernel/events/core.c:101
[<ffffffff816d1bf1>] perf_install_in_context+0x201/0x340
kernel/events/core.c:2164
[<ffffffff816f62e5>] SYSC_perf_event_open+0x1465/0x21a0
kernel/events/core.c:8546
[<ffffffff816ff449>] SyS_perf_event_open+0x39/0x50 kernel/events/core.c:8242
[<ffffffff86a8a3b6>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185







BUG: KASAN: use-after-free in __lock_acquire+0x4e99/0x5100 at addr
ffff88005f77aee0
Read of size 8 by task syzkaller_execu/30337
=============================================================================
BUG kmalloc-1024 (Not tainted): kasan: bad access detected
-----------------------------------------------------------------------------

Disabling lock debugging due to kernel taint
INFO: Allocated in alloc_perf_context+0x4c/0x100 age=124 cpu=3 pid=30312
[< none >] ___slab_alloc+0x648/0x8c0 mm/slub.c:2468
[< none >] __slab_alloc+0x4c/0x90 mm/slub.c:2497
[< inline >] slab_alloc_node mm/slub.c:2560
[< inline >] slab_alloc mm/slub.c:2602
[< none >] kmem_cache_alloc_trace+0x23c/0x3f0 mm/slub.c:2619
[< inline >] kmalloc include/linux/slab.h:458
[< inline >] kzalloc include/linux/slab.h:602
[< none >] alloc_perf_context+0x4c/0x100 kernel/events/core.c:3405
[< none >] find_get_context+0x187/0x830 kernel/events/core.c:3512
[< none >] SYSC_perf_event_open+0xe50/0x21a0 kernel/events/core.c:8381
[< none >] SyS_perf_event_open+0x39/0x50 kernel/events/core.c:8242
[< none >] tracesys_phase2+0x88/0x8d arch/x86/entry/entry_64.S:269

INFO: Freed in free_ctx+0x4b/0x70 age=74 cpu=2 pid=16
[< none >] __slab_free+0x21e/0x3e0 mm/slub.c:2678
[< inline >] slab_free mm/slub.c:2833
[< none >] kfree+0x26f/0x3e0 mm/slub.c:3662
[< none >] free_ctx+0x4b/0x70 kernel/events/core.c:872
[< inline >] __rcu_reclaim kernel/rcu/rcu.h:118
[< inline >] rcu_do_batch kernel/rcu/tree.c:2693
[< inline >] invoke_rcu_callbacks kernel/rcu/tree.c:2961
[< inline >] __rcu_process_callbacks kernel/rcu/tree.c:2928
[< none >] rcu_process_callbacks+0x631/0x19e0 kernel/rcu/tree.c:2945
[< none >] __do_softirq+0x2e5/0xb40 kernel/softirq.c:273
[< none >] run_ksoftirqd+0x20/0x50 kernel/softirq.c:662
[< none >] smpboot_thread_fn+0x41d/0x850 kernel/smpboot.c:163
[< none >] kthread+0x21d/0x2e0 drivers/block/aoe/aoecmd.c:1314
[< none >] ret_from_fork+0x3f/0x70 arch/x86/entry/entry_64.S:468

INFO: Indirect in __call_rcu.constprop.63+0x7e/0xe30 age=88 cpu=2 pid=30312
[< none >] __call_rcu.constprop.63+0x7e/0xe30 kernel/rcu/tree.c:3048
[< none >] call_rcu_sched+0x26/0x30 kernel/rcu/tree.c:3117
[< none >] put_ctx+0x150/0x250 kernel/events/core.c:891
[< inline >] perf_event_exit_task_context kernel/events/core.c:8860
[< none >] perf_event_exit_task+0x706/0xae0 kernel/events/core.c:8887
[< none >] do_exit+0x892/0x3050 kernel/exit.c:759
[< none >] do_group_exit+0xec/0x390 kernel/exit.c:880
[< none >] get_signal+0x677/0x1bf0 kernel/signal.c:2307
[< none >] do_signal+0x7e/0x2170 arch/x86/kernel/signal.c:709
[< none >] exit_to_usermode_loop+0xfe/0x1e0
arch/x86/entry/common.c:247
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:282
[< none >] syscall_return_slowpath+0x16b/0x240
arch/x86/entry/common.c:344
[< none >] int_ret_from_sys_call+0x25/0x9f
arch/x86/entry/entry_64.S:281

INFO: Slab 0xffffea00017dde00 objects=21 used=8 fp=0xffff88005f77aec0
flags=0x5fffc0000004080
INFO: Object 0xffff88005f77aec0 @offset=11968 fp=0xffff88005f77b498
CPU: 3 PID: 30337 Comm: syzkaller_execu Tainted: G B
4.4.0-rc3+ #155
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
0000000000000003 ffff880065547680 ffffffff82e0f8b8 0000000041b58ab3
ffffffff87a9a27d ffffffff82e0f806 ffff880064972e80 ffffffff87abb3d1
ffff88003e806d00 0000000000000008 ffff88005f77aec0 ffff880065547680

Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff82e0f8b8>] dump_stack+0xb2/0xfa lib/dump_stack.c:50
[<ffffffff818369dc>] print_trailer+0x12c/0x210 mm/slub.c:659
[<ffffffff81841a3f>] object_err+0x2f/0x40 mm/slub.c:689
[< inline >] print_address_description mm/kasan/report.c:138
[<ffffffff818457a9>] kasan_report_error+0x5d9/0x860 mm/kasan/report.c:251
[< inline >] kasan_report mm/kasan/report.c:274
[<ffffffff81845bd4>] __asan_report_load8_noabort+0x54/0x70
mm/kasan/report.c:295
[<ffffffff814b25f9>] __lock_acquire+0x4e99/0x5100 kernel/locking/lockdep.c:3092
[<ffffffff814b517d>] lock_acquire+0x19d/0x3f0 kernel/locking/lockdep.c:3585
[< inline >] __raw_spin_lock include/linux/spinlock_api_smp.h:144
[<ffffffff86a89591>] _raw_spin_lock+0x31/0x40 kernel/locking/spinlock.c:151
[< inline >] perf_ctx_lock kernel/events/core.c:351
[<ffffffff816e15d9>] __perf_install_in_context+0x109/0xa00
kernel/events/core.c:2083
[<ffffffff816cb7ba>] remote_function+0x14a/0x200 kernel/events/core.c:74
[<ffffffff81572787>] generic_exec_single+0x2a7/0x490 kernel/smp.c:156
[<ffffffff81573350>] smp_call_function_single+0x200/0x310 kernel/smp.c:300
[<ffffffff816c9bd3>] task_function_call+0x123/0x160 kernel/events/core.c:101
[<ffffffff816d1bf1>] perf_install_in_context+0x201/0x340
kernel/events/core.c:2164
[<ffffffff816f62e5>] SYSC_perf_event_open+0x1465/0x21a0
kernel/events/core.c:8546
[<ffffffff816ff449>] SyS_perf_event_open+0x39/0x50 kernel/events/core.c:8242
[<ffffffff86a8a3b6>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185

Alexei Starovoitov

unread,
Dec 8, 2015, 2:56:28 PM12/8/15
to Dmitry Vyukov, syzkaller, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
so you're saying you want to add hundreds lines of code to the kernel only
to be used by kasan for this very specific and narrow use case
instead of designing generic 'stacktrace<->id' mechanism?
That's not how kernel operates. Every kernel developer must think about
code reuse. We cannot bloat kernel with unique hacks.

Peter Zijlstra

unread,
Dec 9, 2015, 4:05:41 AM12/9/15
to Ingo Molnar, Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Tue, Dec 08, 2015 at 07:57:38PM +0100, Ingo Molnar wrote:
> Btw., could we add more redundancy / debug code to the refcounting code? It seems
> to be a frequent source of very hard to find/fix races/bugs - so it should be ripe
> for some extra debug infrastructure ...

I'll try, but its not easy. The biggest problem so far is the
ctx::parent relation. We cannot easily track the reverse of that because
of locking (we'd need to hold two ctx->lock, and we cannot because
perf_event_context_sched_out()).

The other two relations we have reverse maps for:

task->perf_event_ctxp[] <-> ctx->task
event->ctx <-> ctx->event_list

Also, all 3 relations are (more or less) protected under ctx->lock:

- unclone_ctx():
removes the ctx->parent link, with ctx->lock held

- perf_remove_from_context():
removes the event from ctx::event_list, with ctx->lock held;
leaves event->ctx set, because nasty games with ctx migration,
it will get set to the new context, but can never be NULL.

- perf_event_exit_task_context():
clears task->perf_event_ctxp[], with ctx->lock held;
leaves ctx->task set, still looking into this.



Dmitry Vyukov

unread,
Dec 9, 2015, 4:17:37 AM12/9/15
to Alexei Starovoitov, syzkaller, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Tue, Dec 8, 2015 at 8:56 PM, Alexei Starovoitov
We would happily share this code with other subsystems, or even better
reuse an existing solutions. But to the best of my knowledge there is
no such existing solution, and I still know basically nothing about
what you were hacking, why and what are your requirements.
Our requirements are:
- map stack trace to 4-byte id
- relatively memory efficient for storing of ~100K traces
- the only performance-critical operation is mapping of an already
existing stack (in particular no atomic RMW on this path)
Non-requirements:
- any user-space interface
- removal of stack traces (they all will be reused in future)
We plan to use in KASAN, KTSAN (already uses it), KMSAN.

Alexei Starovoitov

unread,
Dec 9, 2015, 10:54:39 PM12/9/15
to Dmitry Vyukov, syzkaller, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Wed, Dec 09, 2015 at 10:17:17AM +0100, Dmitry Vyukov wrote:
>
> We would happily share this code with other subsystems, or even better
> reuse an existing solutions. But to the best of my knowledge there is
> no such existing solution, and I still know basically nothing about
> what you were hacking, why and what are your requirements.

I need 'stacktrace <-> id' facility to be able to do map[stack]++
where stack is either kernel stack or user stack or combination of both.
Number of stacks can be large, so this 'stack depot' need to be limited
either by total memory or by number of stacks.
Also stacks need to have configurable depth limit to improve find time,
save memory, etc.
Once program is done the stack depot need to be freed.
The user interface bit of translating id to stack is tbd.

> Our requirements are:
> - map stack trace to 4-byte id
> - relatively memory efficient for storing of ~100K traces
> - the only performance-critical operation is mapping of an already
> existing stack (in particular no atomic RMW on this path)
> Non-requirements:
> - any user-space interface
> - removal of stack traces (they all will be reused in future)

Freeing memory is a requirement regardless.
Even when kernel running with kasan, there must be a way to stop
stack collection and free that memory.
You cannot treat kernel as your test program or 'device under test'.
It's the other way around. kasan must nicely co-exist with kernel.
Right now it adds some constant overhead, fine, but saying I'll just
keep taking memory from kernel for my own kasan needs is not ok.

Peter Zijlstra

unread,
Dec 10, 2015, 4:02:54 AM12/10/15
to Alexei Starovoitov, Dmitry Vyukov, syzkaller, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Wed, Dec 09, 2015 at 07:54:35PM -0800, Alexei Starovoitov wrote:
> Freeing memory is a requirement regardless.
> Even when kernel running with kasan, there must be a way to stop
> stack collection and free that memory.
> You cannot treat kernel as your test program or 'device under test'.

Relax, that is exactly what lockdep does. It cannot dynamically allocate
things because allocators use lock etc..

Its fine to build up state for debug bits, esp. if its bounded, like the
number of unique callchains.

Alexei Starovoitov

unread,
Dec 10, 2015, 12:03:10 PM12/10/15
to Peter Zijlstra, Dmitry Vyukov, syzkaller, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
except the code in question is doing unbounded alloc_pages()

Peter Zijlstra

unread,
Dec 10, 2015, 2:57:43 PM12/10/15
to Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Tue, Dec 08, 2015 at 08:14:58PM +0100, Dmitry Vyukov wrote:
> Tested with your patches.
> The additional WARNING does not fire.
> For the rcu stacks, I had to change two more 2's to TRACK_NR and also
> moved memorization from call_rcu to __call_rcu, but now it is working.
> Two reports with indirect stack:

Ah nice, and sorry for the oversights. Obviously one can add this
indirect marker to __queue_work() as well to also cover the workqueue
offloading some other sites do.


> [<ffffffff81845bd4>] __asan_report_load8_noabort+0x54/0x70 mm/kasan/report.c:295
> [<ffffffff814b25f9>] __lock_acquire+0x4e99/0x5100 kernel/locking/lockdep.c:3092
> [<ffffffff814b517d>] lock_acquire+0x19d/0x3f0 kernel/locking/lockdep.c:3585
> [< inline >] __raw_spin_lock include/linux/spinlock_api_smp.h:144
> [<ffffffff86a89591>] _raw_spin_lock+0x31/0x40 kernel/locking/spinlock.c:151
> [< inline >] perf_ctx_lock kernel/events/core.c:351

So this here takes locks on cpuctx->ctx.lock and cpuctx->task_ctx->lock
and it is the latter that is pointing out into space.

> [<ffffffff816e15d9>] __perf_install_in_context+0x109/0xa00 kernel/events/core.c:2083
> [<ffffffff816cb7ba>] remote_function+0x14a/0x200 kernel/events/core.c:74
> [<ffffffff81572787>] generic_exec_single+0x2a7/0x490 kernel/smp.c:156
> [<ffffffff81573350>] smp_call_function_single+0x200/0x310 kernel/smp.c:300
> [<ffffffff816c9bd3>] task_function_call+0x123/0x160 kernel/events/core.c:101
> [<ffffffff816d1bf1>] perf_install_in_context+0x201/0x340 kernel/events/core.c:2164
> [<ffffffff816f62e5>] SYSC_perf_event_open+0x1465/0x21a0 kernel/events/core.c:8546
> [<ffffffff816ff449>] SyS_perf_event_open+0x39/0x50 kernel/events/core.c:8242
> [<ffffffff86a8a3b6>] entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185


Now I managed to produce a related WARN:

[ 2338.884942] ------------[ cut here ]------------
[ 2338.890112] WARNING: CPU: 13 PID: 35162 at ../kernel/events/core.c:2702 task_ctx_sched_out+0x6b/0x80()
[ 2338.900504] Modules linked in:
[ 2338.903933] CPU: 13 PID: 35162 Comm: bash Not tainted 4.4.0-rc4-dirty #244
[ 2338.911610] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[ 2338.923071] ffffffff81f1468e ffff8807c6457cb8 ffffffff815c680c 0000000000000000
[ 2338.931382] ffff8807c6457cf0 ffffffff810c8a56 ffffe8ffff8c1bd0 ffff8808132ed400
[ 2338.939678] 0000000000000286 ffff880813170380 ffff8808132ed400 ffff8807c6457d00
[ 2338.947987] Call Trace:
[ 2338.950726] [<ffffffff815c680c>] dump_stack+0x4e/0x82
[ 2338.956474] [<ffffffff810c8a56>] warn_slowpath_common+0x86/0xc0
[ 2338.963195] [<ffffffff810c8b4a>] warn_slowpath_null+0x1a/0x20
[ 2338.969720] [<ffffffff811a49cb>] task_ctx_sched_out+0x6b/0x80
[ 2338.976244] [<ffffffff811a62d2>] perf_event_exec+0xe2/0x180
[ 2338.982575] [<ffffffff8121fb6f>] setup_new_exec+0x6f/0x1b0
[ 2338.988810] [<ffffffff8126de83>] load_elf_binary+0x393/0x1660
[ 2338.995339] [<ffffffff811dc772>] ? get_user_pages+0x52/0x60
[ 2339.001669] [<ffffffff8121e297>] search_binary_handler+0x97/0x200
[ 2339.008581] [<ffffffff8121f8b3>] do_execveat_common.isra.33+0x543/0x6e0
[ 2339.016072] [<ffffffff8121fcea>] SyS_execve+0x3a/0x50
[ 2339.021819] [<ffffffff819fc165>] stub_execve+0x5/0x5
[ 2339.027469] [<ffffffff819fbeb2>] ? entry_SYSCALL_64_fastpath+0x12/0x71
[ 2339.034860] ---[ end trace ee1337c59a0ddeac ]---


Which is an existing WARN_ON_ONCE() indicating that cpuctx->task_ctx is
not what we expected it to be.

The below patch appears (its so far managed to run longer than previous
attempts, but I'm sure it'll explode the moment I've send this email) to
have cured it.

I'm not sure I can explain your problem with this, but I figure its
worth a try.


---
kernel/events/core.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c3d61b92d805..d5293325d8c5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3136,15 +3136,16 @@ static int event_enable_on_exec(struct perf_event *event,
* Enable all of a task's events that have been marked enable-on-exec.
* This expects task == current.
*/
-static void perf_event_enable_on_exec(struct perf_event_context *ctx)
+static void perf_event_enable_on_exec(int ctxn)
{
- struct perf_event_context *clone_ctx = NULL;
+ struct perf_event_context *ctx, *clone_ctx = NULL;
struct perf_event *event;
unsigned long flags;
int enabled = 0;
int ret;

local_irq_save(flags);
+ ctx = current->perf_event_ctxp[ctxn];
if (!ctx || !ctx->nr_events)
goto out;

@@ -3187,17 +3188,11 @@ static void perf_event_enable_on_exec(struct perf_event_context *ctx)

void perf_event_exec(void)
{
- struct perf_event_context *ctx;
int ctxn;

rcu_read_lock();
- for_each_task_context_nr(ctxn) {
- ctx = current->perf_event_ctxp[ctxn];
- if (!ctx)
- continue;
-
- perf_event_enable_on_exec(ctx);
- }
+ for_each_task_context_nr(ctxn)
+ perf_event_enable_on_exec(ctxn);
rcu_read_unlock();
}

Ingo Molnar

unread,
Dec 11, 2015, 3:14:39 AM12/11/15
to Alexei Starovoitov, Peter Zijlstra, Dmitry Vyukov, syzkaller, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
Yes, but the trick is to still have a bound sized debug pool - which runs out of
entries gracefully.

Which in practice is plenty enough for most types of testing, and is a lot more
robust than any dynamic scheme.

Thanks,

Ingo

Dmitry Vyukov

unread,
Dec 15, 2015, 8:09:50 AM12/15/15
to syzkaller, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
Hi Peter,

I've applied your patch. I did not see the crashes since than. But
I've started seeing lots of stalls/hangs, see the other email titled
"perf: stalls in perf_install_in_context/perf_remove_from_context". I
am able to reproduce the stalls on stock
6764e5ebd5c62236d082f9ae030674467d0b2779 without any patches on top,
so the stalls are not related to your patch.
I am not sure why I started seeing the stalls only now. It is either a
recent change in the tree, or maybe fuzzer figured out how to trigger
the stall and started doing it frequently, or maybe your fix for the
crashes allowed fuzzer to drill deeper into perf... or maybe I just
missed stall reports between numerous crash reports. Anyway, testing
is kind of inconclusive now. Please take a look at the stall issue.

Thanks

Dmitry Vyukov

unread,
Dec 15, 2015, 8:11:33 AM12/15/15
to Ingo Molnar, Alexei Starovoitov, Peter Zijlstra, syzkaller, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
A hard upper bound on consumed memory would work for us without
introducing any slowdown and without increasing code complexity. So it
sounds good to me.

Peter Zijlstra

unread,
Dec 17, 2015, 9:06:18 AM12/17/15
to Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Thu, Dec 10, 2015 at 08:57:40PM +0100, Peter Zijlstra wrote:
> I'm not sure I can explain your problem with this, but I figure its
> worth a try.

Did it make a difference?

Also, does the syzkaller thing do cpu-hotplug during its runs?

Dmitry Vyukov

unread,
Dec 17, 2015, 9:09:10 AM12/17/15
to Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Thu, Dec 17, 2015 at 3:06 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> On Thu, Dec 10, 2015 at 08:57:40PM +0100, Peter Zijlstra wrote:
>> I'm not sure I can explain your problem with this, but I figure its
>> worth a try.
>
> Did it make a difference?
>
> Also, does the syzkaller thing do cpu-hotplug during its runs?

See my email on Dec 15.
I have CONFIG_HOTPLUG enabled, but I don't do actual hotplug (well,
actually I don't know how to do hotplug, but I still think I am not
doing it).

Peter Zijlstra

unread,
Dec 17, 2015, 9:26:22 AM12/17/15
to Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Thu, Dec 17, 2015 at 03:08:49PM +0100, Dmitry Vyukov wrote:
> On Thu, Dec 17, 2015 at 3:06 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> > On Thu, Dec 10, 2015 at 08:57:40PM +0100, Peter Zijlstra wrote:
> >> I'm not sure I can explain your problem with this, but I figure its
> >> worth a try.
> >
> > Did it make a difference?
> >
> > Also, does the syzkaller thing do cpu-hotplug during its runs?
>
> See my email on Dec 15.

I have 4 emails from you on that day (and one private), they pertain to
getting stuck in perf_{install,remove}_from_context(). None of those
emails seem to contain an answer to either of the two above questions.

CACT4Y+Za4L4q6FZLXjyGbnSY...@mail.gmail.com
CACT4Y+YUkYgCuQFW6eBTtdfR...@mail.gmail.com
CACT4Y+bvHWuyDZjoFhLAOiK1rfNQcfuSJCY0oS5xA=u8j0...@mail.gmail.com
CACT4Y+an67KnTsVVzYVvrYKW...@mail.gmail.com

and in private:

CACT4Y+Zb1vU192JbCnyS9qY3...@mail.gmail.com

Did I miss an email?

This is the thread about the difficult to deterministically reproduce
use-after-free.

> I have CONFIG_HOTPLUG enabled, but I don't do actual hotplug (well,
> actually I don't know how to do hotplug, but I still think I am not
> doing it).

You can do hotplug like:

echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

so if syzkaller doesn't know about that, and I suppose you not
knowing this means syzkaller doesn't either, this isn't happening.

Thanks, that disqualifies a bunch of code.

Peter Zijlstra

unread,
Dec 17, 2015, 9:28:27 AM12/17/15
to Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Thu, Dec 17, 2015 at 03:26:17PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 17, 2015 at 03:08:49PM +0100, Dmitry Vyukov wrote:
> > On Thu, Dec 17, 2015 at 3:06 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> > > On Thu, Dec 10, 2015 at 08:57:40PM +0100, Peter Zijlstra wrote:
> > >> I'm not sure I can explain your problem with this, but I figure its
> > >> worth a try.
> > >
> > > Did it make a difference?
> > >
> > > Also, does the syzkaller thing do cpu-hotplug during its runs?
> >
> > See my email on Dec 15.
>
> I have 4 emails from you on that day (and one private), they pertain to
> getting stuck in perf_{install,remove}_from_context(). None of those
> emails seem to contain an answer to either of the two above questions.
>
> CACT4Y+Za4L4q6FZLXjyGbnSY...@mail.gmail.com
> CACT4Y+YUkYgCuQFW6eBTtdfR...@mail.gmail.com
> CACT4Y+bvHWuyDZjoFhLAOiK1rfNQcfuSJCY0oS5xA=u8j0...@mail.gmail.com
> CACT4Y+an67KnTsVVzYVvrYKW...@mail.gmail.com
>
> and in private:
>
> CACT4Y+Zb1vU192JbCnyS9qY3...@mail.gmail.com
>
> Did I miss an email?

Yes, I seem to have missed this one:

CACT4Y+Y_YdEd-YGdWq-wQV6S...@mail.gmail.com

found it, thanks!

Dmitry Vyukov

unread,
Dec 17, 2015, 9:35:53 AM12/17/15
to Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Thu, Dec 17, 2015 at 3:28 PM, Peter Zijlstra <pet...@infradead.org> wrote:
>> > >> I'm not sure I can explain your problem with this, but I figure its
>> > >> worth a try.
>> > >
>> > > Did it make a difference?
>> > >
>> > > Also, does the syzkaller thing do cpu-hotplug during its runs?
>> >
>> > See my email on Dec 15.
>>
>> I have 4 emails from you on that day (and one private), they pertain to
>> getting stuck in perf_{install,remove}_from_context(). None of those
>> emails seem to contain an answer to either of the two above questions.
>>
>> CACT4Y+Za4L4q6FZLXjyGbnSY...@mail.gmail.com
>> CACT4Y+YUkYgCuQFW6eBTtdfR...@mail.gmail.com
>> CACT4Y+bvHWuyDZjoFhLAOiK1rfNQcfuSJCY0oS5xA=u8j0...@mail.gmail.com
>> CACT4Y+an67KnTsVVzYVvrYKW...@mail.gmail.com
>>
>> and in private:
>>
>> CACT4Y+Zb1vU192JbCnyS9qY3...@mail.gmail.com
>>
>> Did I miss an email?
>
> Yes, I seem to have missed this one:
>
> CACT4Y+Y_YdEd-YGdWq-wQV6S...@mail.gmail.com
>
> found it, thanks!

Yes, that one.
In short, I did not see use-after-frees but perf_event_open fuzzing
started hanging VMs very frequently, so testing is inconclusive.

And, no, nobody touches anything in /sys/devices.

Peter Zijlstra

unread,
Dec 17, 2015, 9:43:41 AM12/17/15
to Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
On Thu, Dec 17, 2015 at 03:35:32PM +0100, Dmitry Vyukov wrote:
> In short, I did not see use-after-frees but perf_event_open fuzzing
> started hanging VMs very frequently, so testing is inconclusive.

Right, I'll get back to staring at that one.

Running your test case on bare metal got my machine reproducably stuck
but in a different place than you've reported.

Takes a while too, so far no luck in explaining/debugging it. All my
added debug code remains silent :/

I'll go think up more ..

Dmitry Vyukov

unread,
Dec 31, 2015, 12:16:02 PM12/31/15
to Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
Hi Peter,

Any progress with this?
I can't continue to test perf subsytem, as it reliably hangs my machines.

Peter Zijlstra

unread,
Jan 5, 2016, 7:17:30 AM1/5/16
to Dmitry Vyukov, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
Holidays got in the way, I'll continue asap when I've cleared some of
the easy mail backlog from the past two weeks.

tip-bot for Peter Zijlstra

unread,
Jan 6, 2016, 1:47:01 PM1/6/16
to linux-ti...@vger.kernel.org, k...@google.com, gli...@google.com, torv...@linux-foundation.org, dvy...@google.com, sasha...@oracle.com, pet...@infradead.org, syzk...@googlegroups.com, edum...@google.com, linux-...@vger.kernel.org, ac...@redhat.com, vincent...@maine.edu, tg...@linutronix.de, era...@google.com, ac...@kernel.org, h...@zytor.com, mi...@kernel.org, jo...@redhat.com
Commit-ID: c127449944659543e5e2423002f08f0af98dba5c
Gitweb: http://git.kernel.org/tip/c127449944659543e5e2423002f08f0af98dba5c
Author: Peter Zijlstra <pet...@infradead.org>
AuthorDate: Thu, 10 Dec 2015 20:57:40 +0100
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Wed, 6 Jan 2016 10:52:38 +0100

perf: Fix race in perf_event_exec()

I managed to tickle this warning:

[ 2338.884942] ------------[ cut here ]------------
[ 2338.890112] WARNING: CPU: 13 PID: 35162 at ../kernel/events/core.c:2702 task_ctx_sched_out+0x6b/0x80()
[ 2338.900504] Modules linked in:
[ 2338.903933] CPU: 13 PID: 35162 Comm: bash Not tainted 4.4.0-rc4-dirty #244
[ 2338.911610] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[ 2338.923071] ffffffff81f1468e ffff8807c6457cb8 ffffffff815c680c 0000000000000000
[ 2338.931382] ffff8807c6457cf0 ffffffff810c8a56 ffffe8ffff8c1bd0 ffff8808132ed400
[ 2338.939678] 0000000000000286 ffff880813170380 ffff8808132ed400 ffff8807c6457d00
[ 2338.947987] Call Trace:
[ 2338.950726] [<ffffffff815c680c>] dump_stack+0x4e/0x82
[ 2338.956474] [<ffffffff810c8a56>] warn_slowpath_common+0x86/0xc0
[ 2338.963195] [<ffffffff810c8b4a>] warn_slowpath_null+0x1a/0x20
[ 2338.969720] [<ffffffff811a49cb>] task_ctx_sched_out+0x6b/0x80
[ 2338.976244] [<ffffffff811a62d2>] perf_event_exec+0xe2/0x180
[ 2338.982575] [<ffffffff8121fb6f>] setup_new_exec+0x6f/0x1b0
[ 2338.988810] [<ffffffff8126de83>] load_elf_binary+0x393/0x1660
[ 2338.995339] [<ffffffff811dc772>] ? get_user_pages+0x52/0x60
[ 2339.001669] [<ffffffff8121e297>] search_binary_handler+0x97/0x200
[ 2339.008581] [<ffffffff8121f8b3>] do_execveat_common.isra.33+0x543/0x6e0
[ 2339.016072] [<ffffffff8121fcea>] SyS_execve+0x3a/0x50
[ 2339.021819] [<ffffffff819fc165>] stub_execve+0x5/0x5
[ 2339.027469] [<ffffffff819fbeb2>] ? entry_SYSCALL_64_fastpath+0x12/0x71
[ 2339.034860] ---[ end trace ee1337c59a0ddeac ]---

Which is a WARN_ON_ONCE() indicating that cpuctx->task_ctx is not
what we expected it to be.

This is because context switches can swap the task_struct::perf_event_ctxp[]
pointer around. Therefore you have to either disable preemption when looking
at current, or hold ctx->lock.

Fix perf_event_enable_on_exec(), it loads current->perf_event_ctxp[]
before disabling interrupts, therefore a preemption in the right place
can swap contexts around and we're using the wrong one.

Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Alexander Potapenko <gli...@google.com>
Cc: Arnaldo Carvalho de Melo <ac...@kernel.org>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Dmitry Vyukov <dvy...@google.com>
Cc: Eric Dumazet <edum...@google.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Kostya Serebryany <k...@google.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Sasha Levin <sasha...@oracle.com>
Cc: Stephane Eranian <era...@google.com>
Cc: Thomas Gleixner <tg...@linutronix.de>
Cc: Vince Weaver <vincent...@maine.edu>
Cc: syzkaller <syzk...@googlegroups.com>
Link: http://lkml.kernel.org/r/2015121019...@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
kernel/events/core.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 39cf4a4..fd7de04 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3154,15 +3154,16 @@ static int event_enable_on_exec(struct perf_event *event,
* Enable all of a task's events that have been marked enable-on-exec.
* This expects task == current.
*/
-static void perf_event_enable_on_exec(struct perf_event_context *ctx)
+static void perf_event_enable_on_exec(int ctxn)
{
- struct perf_event_context *clone_ctx = NULL;
+ struct perf_event_context *ctx, *clone_ctx = NULL;
struct perf_event *event;
unsigned long flags;
int enabled = 0;
int ret;

local_irq_save(flags);
+ ctx = current->perf_event_ctxp[ctxn];
if (!ctx || !ctx->nr_events)
goto out;

@@ -3205,17 +3206,11 @@ out:

Eric Dumazet

unread,
Jan 6, 2016, 1:56:57 PM1/6/16
to syzkaller, Eric Dumazet, LKML, Peter Zijlstra, ac...@redhat.com, Linus Torvalds, Alexander Potapenko, Kostya Serebryany, Dmitry Vyukov, Sasha Levin, H. Peter Anvin, jo...@redhat.com, Ingo Molnar, vincent...@maine.edu, Thomas Gleixner, ac...@kernel.org, Stephane Eranian, linux-ti...@vger.kernel.org
On Wed, Jan 6, 2016 at 1:46 PM, tip-bot for Peter Zijlstra
<tip...@zytor.com> wrote:

>
> This is because context switches can swap the task_struct::perf_event_ctxp[]
> pointer around. Therefore you have to either disable preemption when looking
> at current, or hold ctx->lock.
>


>
> void perf_event_exec(void)
> {
> - struct perf_event_context *ctx;
> int ctxn;
>
> rcu_read_lock();

Do we still need this rcu_read_lock(), if perf_event_enable_on_exec()
uses local_irq_save( ?

Peter Zijlstra

unread,
Jan 7, 2016, 8:40:14 AM1/7/16
to Eric Dumazet, syzkaller, LKML, ac...@redhat.com, Linus Torvalds, Alexander Potapenko, Kostya Serebryany, Dmitry Vyukov, Sasha Levin, H. Peter Anvin, jo...@redhat.com, Ingo Molnar, vincent...@maine.edu, Thomas Gleixner, ac...@kernel.org, Stephane Eranian, linux-ti...@vger.kernel.org, Paul McKenney
On Wed, Jan 06, 2016 at 01:56:56PM -0500, Eric Dumazet wrote:
> On Wed, Jan 6, 2016 at 1:46 PM, tip-bot for Peter Zijlstra
> <tip...@zytor.com> wrote:
>
> >
> > This is because context switches can swap the task_struct::perf_event_ctxp[]
> > pointer around. Therefore you have to either disable preemption when looking
> > at current, or hold ctx->lock.
> >
>
>
> >
> > void perf_event_exec(void)
> > {
> > - struct perf_event_context *ctx;
> > int ctxn;
> >
> > rcu_read_lock();
>
> Do we still need this rcu_read_lock(), if perf_event_enable_on_exec()
> uses local_irq_save( ?

Strictly speaking we should not rely on the fact that RCU grace periods
do not progress with IRQs disabled, so yes.

Paul E. McKenney

unread,
Jan 7, 2016, 11:26:01 AM1/7/16
to Peter Zijlstra, Eric Dumazet, syzkaller, LKML, ac...@redhat.com, Linus Torvalds, Alexander Potapenko, Kostya Serebryany, Dmitry Vyukov, Sasha Levin, H. Peter Anvin, jo...@redhat.com, Ingo Molnar, vincent...@maine.edu, Thomas Gleixner, ac...@kernel.org, Stephane Eranian, linux-ti...@vger.kernel.org
What Peter said!

The current implementation would let you get away with IRQs disabled
(aside from lockdep splats), but it is easy to imagine implementations
that do not, hence the restriction.

Is the extra rcu_read_lock() and rcu_read_unlock() causing a performance
problem? If so, one option would be use of synchronize_sched() and
call_rcu_sched(), which explicitly recognize disabled IRQs as read-side
critical sections. Of course, this might well put you in a position
where you would like to use IRQ disabling in some cases, BH disabling
(or softirq context) in others, and rcu_read_lock() in yet other cases.

If so, the good news is that RCU actually now supports this efficiently.
There is the shiny new synchronize_rcu_mult() macro that can be used
as follows to wait for all three types of grace periods:

synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched);

This would wait concurrently for all three grace periods to elapse.

Thanx, Paul

Eric Dumazet

unread,
Jan 7, 2016, 11:36:22 AM1/7/16
to Paul McKenney, Peter Zijlstra, syzkaller, LKML, ac...@redhat.com, Linus Torvalds, Alexander Potapenko, Kostya Serebryany, Dmitry Vyukov, Sasha Levin, H. Peter Anvin, jo...@redhat.com, Ingo Molnar, vincent...@maine.edu, Thomas Gleixner, ac...@kernel.org, Stephane Eranian, linux-ti...@vger.kernel.org
I guess I could not see why the rcu_read_lock() was needed around this loop.

rcu_read_lock();
for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
perf_event_enable_on_exec(ctxn);
rcu_read_unlock();

But apparently it is needed in perf_event_enable_on_exec() and we avoid calling
rcu_read_lock() and rcu_read_unlock() (perf_nr_task_contexts - 1 ) times

> If so, the good news is that RCU actually now supports this efficiently.
> There is the shiny new synchronize_rcu_mult() macro that can be used
> as follows to wait for all three types of grace periods:
>
> synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched);
>
> This would wait concurrently for all three grace periods to elapse.
>

Interesting ;)

Paul E. McKenney

unread,
Jan 7, 2016, 11:46:24 AM1/7/16
to Eric Dumazet, Peter Zijlstra, syzkaller, LKML, ac...@redhat.com, Linus Torvalds, Alexander Potapenko, Kostya Serebryany, Dmitry Vyukov, Sasha Levin, H. Peter Anvin, jo...@redhat.com, Ingo Molnar, vincent...@maine.edu, Thomas Gleixner, ac...@kernel.org, Stephane Eranian, linux-ti...@vger.kernel.org
If the overhead of rcu_read_lock() and rcu_read_unlock() really are
a noticeable fraction of the loop overhead, and if perf_nr_task_contexts
is large, then you might well have a problem. I still suggest actually
measuring it, but you know me. ;-)

> > If so, the good news is that RCU actually now supports this efficiently.
> > There is the shiny new synchronize_rcu_mult() macro that can be used
> > as follows to wait for all three types of grace periods:
> >
> > synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched);
> >
> > This would wait concurrently for all three grace periods to elapse.
>
> Interesting ;)

;-) ;-) ;-)

Thanx, Paul

Dmitry Vyukov

unread,
Jan 8, 2016, 3:40:29 AM1/8/16
to Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
Hi Peter,

I've hit the following WARNING twice over night run. This is on
b06f3a168cdcd80026276898fd1fee443ef25743 (Jan 6) + the following patch
from you. I can't easily reproduce it. Does it say something to you?
It looks like it all happen in the context of a single perf_event_open
call.


------------[ cut here ]------------
WARNING: CPU: 2 PID: 2523 at kernel/events/core.c:1891
event_sched_in.isra.95+0x84b/0xcd0()
Modules linked in:
CPU: 2 PID: 2523 Comm: syz-executor Not tainted 4.4.0-rc8+ #206
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
00000000ffffffff ffff880042d7f7d8 ffffffff8289a56d 0000000000000000
ffff8800590b9780 ffffffff85db2300 ffff880042d7f818 ffffffff812eb849
ffffffff815ae34b ffffffff85db2300 0000000000000763 ffff88005f35f4e0
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff8289a56d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50
[<ffffffff812eb849>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
[<ffffffff812eba79>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
[<ffffffff815ae34b>] event_sched_in.isra.95+0x84b/0xcd0
kernel/events/core.c:1891
[<ffffffff815b4164>] group_sched_in+0x114/0x410 kernel/events/core.c:1963
[< inline >] ctx_pinned_sched_in kernel/events/core.c:2744
[<ffffffff815b5085>] ctx_sched_in+0xc25/0x11b0 kernel/events/core.c:2807
[<ffffffff815b5657>] perf_event_sched_in+0x47/0xa0 kernel/events/core.c:2073
[<ffffffff815b587c>] perf_event_context_sched_in+0x1cc/0x380
kernel/events/core.c:2844
[<ffffffff815bf4f3>] __perf_event_task_sched_in+0x3a3/0x480
kernel/events/core.c:2872
[< inline >] perf_event_task_sched_in include/linux/perf_event.h:931
[<ffffffff8136d40f>] finish_task_switch+0x3cf/0x610 kernel/sched/core.c:2576
[<ffffffff85c7b08d>] __schedule+0x902016/01/07 20:52:38 executing program 2:
[<ffffffff85c7cbc2>] preempt_schedule_common+0x42/0x70 kernel/sched/core.c:3250
[<ffffffff85c7cc07>] _cond_resched+0x17/0x20 kernel/sched/core.c:4618
[< inline >] slab_pre_alloc_hook mm/slub.c:1317
[< inline >] slab_alloc_node mm/slub.c:2520
[< inline >] slab_alloc mm/slub.c:2602
[<ffffffff816cba4b>] kmem_cache_alloc_trace+0x24b/0x2f0 mm/slub.c:2619
[< inline >] kmalloc include/linux/slab.h:458
[< inline >] kzalloc include/linux/slab.h:602
[< inline >] aa_alloc_file_context
security/apparmor/include/context.h:46
[<ffffffff826a9c85>] apparmor_file_alloc_security+0x45/0xa0
security/apparmor/lsm.c:402
[<ffffffff825e2a1d>] security_file_alloc+0x6d/0xa0 security/security.c:742
[<ffffffff81716aba>] get_empty_filp+0x13a/0x3f0 fs/file_table.c:128
[<ffffffff81716d90>] alloc_file+0x20/0x380 fs/file_table.c:163
[<ffffffff817f4fdd>] anon_inode_getfile+0x19d/0x340 fs/anon_inodes.c:105
[<ffffffff815c7daa>] SYSC_perf_event_open+0x12da/0x1fa0
kernel/events/core.c:8471
[<ffffffff815cf609>] SyS_perf_event_open+0x39/0x50 kernel/events/core.c:8262
[<ffffffff85c8a2b6>] entry_SYSCALL_64_fastpath+0x16/0x7a
arch/x86/entry/entry_64.S:185
---[ end trace 326ea08cc5f1b454 ]---


diff --git a/kernel/events/core.c b/kernel/events/core.c
index ef2d6ea..fdf78fb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -881,13 +881,22 @@ static void free_ctx(struct rcu_head *head)
kfree(ctx);
}

+#define for_each_task_context_nr(ctxn) \
+ for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
+
static void put_ctx(struct perf_event_context *ctx)
{
if (atomic_dec_and_test(&ctx->refcount)) {
if (ctx->parent_ctx)
put_ctx(ctx->parent_ctx);
- if (ctx->task)
+ if (ctx->task) {
+ int ctxn;
+
+ for_each_task_context_nr(ctxn)
+
WARN_ON(ctx->task->perf_event_ctxp[ctxn] == ctx);
+
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ef2d6ea..fdf78fb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -881,13 +881,22 @@ static void free_ctx(struct rcu_head *head)
kfree(ctx);
}

+#define for_each_task_context_nr(ctxn) \
+ for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
+
static void put_ctx(struct perf_event_context *ctx)
{
if (atomic_dec_and_test(&ctx->refcount)) {
if (ctx->parent_ctx)
put_ctx(ctx->parent_ctx);
- if (ctx->task)
+ if (ctx->task) {
+ int ctxn;
+
+ for_each_task_context_nr(ctxn)
+
WARN_ON(ctx->task->perf_event_ctxp[ctxn] == ctx);
+
put_task_struct(ctx->task);
+ }
call_rcu(&ctx->rcu_head, free_ctx);
}
}
@@ -2658,9 +2667,6 @@ static void perf_pmu_sched_task(struct task_struct *prev,
static void perf_event_switch(struct task_struct *task,
struct task_struct *next_prev, bool sched_in);

-#define for_each_task_context_nr(ctxn) \
- for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
-
/*
* Called from scheduler to remove the events of the current task,
* with interrupts disabled.
@@ -3154,15 +3160,16 @@ static int event_enable_on_exec(struct
perf_event *event,
* Enable all of a task's events that have been marked enable-on-exec.
* This expects task == current.
*/
-static void perf_event_enable_on_exec(struct perf_event_context *ctx)
+static void perf_event_enable_on_exec(int ctxn)
{
- struct perf_event_context *clone_ctx = NULL;
+ struct perf_event_context *ctx, *clone_ctx = NULL;
struct perf_event *event;
unsigned long flags;
int enabled = 0;
int ret;

local_irq_save(flags);
+ ctx = current->perf_event_ctxp[ctxn];
if (!ctx || !ctx->nr_events)
goto out;

@@ -3205,17 +3212,11 @@ out:

void perf_event_exec(void)
{
- struct perf_event_context *ctx;
int ctxn;

rcu_read_lock();
- for_each_task_context_nr(ctxn) {
- ctx = current->perf_event_ctxp[ctxn];
- if (!ctx)
- continue;
-
- perf_event_enable_on_exec(ctx);
- }
+ for_each_task_context_nr(ctxn)
+ perf_event_enable_on_exec(ctxn);
rcu_read_unlock();
}

@@ -8967,6 +8968,8 @@ again:

mutex_unlock(&ctx->mutex);

+ task->perf_event_ctxp[ctxn] = NULL;
+
put_ctx(ctx);
}
}

Dmitry Vyukov

unread,
Jan 8, 2016, 5:28:26 AM1/8/16
to Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, LKML, syzkaller, Kostya Serebryany, Alexander Potapenko, Eric Dumazet, Sasha Levin
I've added the following change in an attempt to catch this WARNING:

@@ -2835,6 +2841,10 @@ static void perf_event_context_sched_in(struct
perf_event_context *ctx,
if (ctx->nr_events)
cpuctx->task_ctx = ctx;

+ mdelay(10);
+ if (cpuctx->task_ctx != NULL && cpuctx->task_ctx != ctx)
+ pr_err("HERE!!!");
+
perf_event_sched_in(cpuctx, cpuctx->task_ctx, task);

perf_pmu_enable(ctx->pmu);

Does this change legitimate?
Now my program that stresses perf hangs in an interesting state.
Processes consume some CPU:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

6671 nobody 20 0 0 0
0 Z 20.2 0.0 13:41.43 syz-executor

6542 nobody 20 0 0 0 0 Z 17.6
0.0 13:25.74 syz-executor

6558 nobody 20 0 0 0 0 Z 17.6 0.0 13:32.36
syz-executor

Some are blocked on futexes:

# cat /proc/6671/task/**/stack
[<ffffffff812f5bc4>] do_exit+0x1ab4/0x2b20
[<ffffffff812f6da8>] do_group_exit+0x108/0x320
[<ffffffff812f6fdd>] SyS_exit_group+0x1d/0x20
[<ffffffff85c8a2f6>] entry_SYSCALL_64_fastpath+0x16/0x7a
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8148464d>] futex_wait_queue_me+0x30d/0x5e0
[<ffffffff81485125>] futex_wait+0x215/0x5f0
[<ffffffff81489738>] do_futex+0x218/0x1680
[<ffffffff8148acd3>] SyS_futex+0x133/0x2b0
[<ffffffff85c8a2f6>] entry_SYSCALL_64_fastpath+0x16/0x7a
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8148464d>] futex_wait_queue_me+0x30d/0x5e0
[<ffffffff81485125>] futex_wait+0x215/0x5f0
[<ffffffff81489738>] do_futex+0x218/0x1680
[<ffffffff8148acd3>] SyS_futex+0x133/0x2b0
[<ffffffff85c8a2f6>] entry_SYSCALL_64_fastpath+0x16/0x7a
[<ffffffffffffffff>] 0xffffffffffffffff

And some hang in exit_to_usermode_loop:

# cat /proc/6542/task/**/stack
[<ffffffff812f5bc4>] do_exit+0x1ab4/0x2b20
[<ffffffff812f6da8>] do_group_exit+0x108/0x320
[<ffffffff812f6fdd>] SyS_exit_group+0x1d/0x20
[<ffffffff85c8a2f6>] entry_SYSCALL_64_fastpath+0x16/0x7a
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff810038e5>] exit_to_usermode_loop+0xd5/0x1a0
[<ffffffff8100631f>] syscall_return_slowpath+0x19f/0x210
[<ffffffff85c8a462>] int_ret_from_sys_call+0x25/0x9f
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8148464d>] futex_wait_queue_me+0x30d/0x5e0
[<ffffffff81485125>] futex_wait+0x215/0x5f0
[<ffffffff81489738>] do_futex+0x218/0x1680
[<ffffffff8148acd3>] SyS_futex+0x133/0x2b0
[<ffffffff85c8a2f6>] entry_SYSCALL_64_fastpath+0x16/0x7a
[<ffffffffffffffff>] 0xffffffffffffffff

They are unkillable and one of the threaded did do_group_exit...
Reply all
Reply to author
Forward
0 new messages