Q: sys_perf_event_open() && PF

Oleg Nesterov

unread,

Nov 8, 2010, 10:10:01 AM11/8/10

to

I am puzzled by PF_EXITING check in find_lively_task_by_vpid().

How can it help? The task can call do_exit() right after the check.

And why do we need it? The comment only says "Can't attach events to
a dying task". Maybe it tries protect sys_perf_event_open() against
perf_event_exit_task_context(), but it can't.

c93f7669 "perf_counter: Fix race in attaching counters to tasks and
exiting" says:

There is also a race between perf_counter_exit_task and
find_get_context; this solves the race by moving the get_ctx that
was in perf_counter_alloc into the locked region in find_get_context,
so that once find_get_context has got the context for a task, it
won't get freed even if the task calls perf_counter_exit_task.

OK, the code was changed since that commit, but afaics "it won't be
freed" is still true.

However,

It
doesn't matter if new top-level (non-inherited) counters get attached
to the context after perf_counter_exit_task has detached the context
from the task. They will just stay there and never get scheduled in
until the counters' fds get closed, and then perf_release will remove
them from the context and eventually free the context.

This looks wrong. perf_release() does free_event()->put_ctx(), this pairs
get_ctx() after alloc_perf_context().

But __perf_event_init_context() sets ctx->refcount = 1, and I guess this
reference should be dropped by ctx->task ? If yes, then it is not OK to
attach the event after sys_perf_event_open().

No?

Hmm. jump_label_inc/dec looks obviously racy too. Say, free_event() races
with perf_event_alloc(). There is a window between atomic_xxx() and
jump_label_update(), afaics it is possible to call jump_label_disable()
when perf_task_events/perf_swevent_enabled != 0.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Oleg Nesterov

unread,

Jan 19, 2011, 1:30:02 PM1/19/11

to

On 11/08, Oleg Nesterov wrote:
>
> I am puzzled by PF_EXITING check in find_lively_task_by_vpid().
>
> How can it help? The task can call do_exit() right after the check.
>
> And why do we need it? The comment only says "Can't attach events to
> a dying task". Maybe it tries protect sys_perf_event_open() against
> perf_event_exit_task_context(), but it can't.

Yes.

Please see 1/2. Well, I can't say I really like the idea to reuse
task->perf_event_mutex, but I do not see a better fix.

Also, I have no idea how can I actually test the changes in the code
I can hardly understand, please review.

Also. I believe there are more problems in perf_install_event(), but
I need to recheck.

> Hmm. jump_label_inc/dec looks obviously racy too. Say, free_event() races
> with perf_event_alloc(). There is a window between atomic_xxx() and
> jump_label_update(), afaics it is possible to call jump_label_disable()
> when perf_task_events/perf_swevent_enabled != 0.

Another issue...

Oleg Nesterov

unread,

Jan 19, 2011, 1:40:02 PM1/19/11

to

perf_event_init_task() should clear child->perf_event_ctxp[] before
anything else. Otherwise, if perf_event_init_context(perf_hw_context)
fails, perf_event_free_task() can free perf_event_ctxp[perf_sw_context]
copied from parent->perf_event_ctxp[] by dup_task_struct().

Also move the initialization of perf_event_mutex and perf_event_list
from perf_event_init_context() to perf_event_init_context().

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

kernel/perf_event.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

--- git/kernel/perf_event.c~4_perf_event_init_task 2011-01-19 17:41:16.000000000 +0100
+++ git/kernel/perf_event.c 2011-01-19 18:49:23.000000000 +0100
@@ -6446,11 +6446,6 @@ int perf_event_init_context(struct task_
unsigned long flags;
int ret = 0;

- child->perf_event_ctxp[ctxn] = NULL;
-
- mutex_init(&child->perf_event_mutex);
- INIT_LIST_HEAD(&child->perf_event_list);
-
if (likely(!parent->perf_event_ctxp[ctxn]))
return 0;

@@ -6540,6 +6535,10 @@ int perf_event_init_task(struct task_str
{
int ctxn, ret;

+ memset(child->perf_event_ctxp, 0, sizeof(child->perf_event_ctxp));
+ mutex_init(&child->perf_event_mutex);
+ INIT_LIST_HEAD(&child->perf_event_list);
+
for_each_task_context_nr(ctxn) {
ret = perf_event_init_context(child, ctxn);
if (ret)

Oleg Nesterov

unread,

Jan 19, 2011, 1:40:01 PM1/19/11

to

find_get_context() must not install the new perf_event_context if the
task has already passed perf_event_exit_task().

If nothing else, this means the memory leak. Initially ctx->refcount == 2,
it is supposed that perf_event_exit_task_context() should participate and
do the necessary put_ctx().

find_lively_task_by_vpid() checks PF_EXITING but this buys nothing, by the
time we call find_get_context() this task can be already dead. To the point,
cmpxchg() can succeed when the task has already done the last schedule().

Change find_get_context() to populate task->perf_event_ctxp[] under
task->perf_event_mutex, this way we can trust PF_EXITING because
perf_event_exit_task() takes the same mutex.

Also, change perf_event_exit_task_context() to use rcu_dereference().
Probably this is not strictly needed, but with or without this change
find_get_context() can race with setup_new_exec()->perf_event_exit_task(),
rcu_dereference() looks better.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

kernel/perf_event.c | 34 ++++++++++++++++++++--------------
1 file changed, 20 insertions(+), 14 deletions(-)

--- git/kernel/perf_event.c~3_find_get_context_vs_exit 2011-01-18 16:57:08.000000000 +0100
+++ git/kernel/perf_event.c 2011-01-19 17:41:16.000000000 +0100
@@ -2201,13 +2201,6 @@ find_lively_task_by_vpid(pid_t vpid)
if (!task)
return ERR_PTR(-ESRCH);

- /*
- * Can't attach events to a dying task.
- */
- err = -ESRCH;
- if (task->flags & PF_EXITING)
- goto errout;
-
/* Reuse ptrace permission checks for now. */
err = -EACCES;
if (!ptrace_may_access(task, PTRACE_MODE_READ))
@@ -2268,14 +2261,27 @@ retry:

get_ctx(ctx);

- if (cmpxchg(&task->perf_event_ctxp[ctxn], NULL, ctx)) {
- /*
- * We raced with some other task; use
- * the context they set.
- */
+ err = 0;
+ mutex_lock(&task->perf_event_mutex);
+ /*
+ * If it has already passed perf_event_exit_task().
+ * we must see PF_EXITING, it takes this mutex too.
+ */
+ if (task->flags & PF_EXITING)
+ err = -ESRCH;
+ else if (task->perf_event_ctxp[ctxn])
+ err = -EAGAIN;
+ else
+ rcu_assign_pointer(task->perf_event_ctxp[ctxn], ctx);
+ mutex_unlock(&task->perf_event_mutex);
+
+ if (unlikely(err)) {
put_task_struct(task);
kfree(ctx);
- goto retry;
+
+ if (err == -EAGAIN)
+ goto retry;
+ goto errout;
}
}

@@ -6127,7 +6133,7 @@ static void perf_event_exit_task_context
* scheduled, so we are now safe from rescheduling changing
* our context.
*/
- child_ctx = child->perf_event_ctxp[ctxn];
+ child_ctx = rcu_dereference(child->perf_event_ctxp[ctxn]);
task_ctx_sched_out(child_ctx, EVENT_ALL);

/*

Peter Zijlstra

unread,

Jan 19, 2011, 1:50:01 PM1/19/11

to

On Wed, 2011-01-19 at 19:22 +0100, Oleg Nesterov wrote:
> find_get_context() must not install the new perf_event_context if the
> task has already passed perf_event_exit_task().
>
> If nothing else, this means the memory leak. Initially ctx->refcount == 2,
> it is supposed that perf_event_exit_task_context() should participate and
> do the necessary put_ctx().
>
> find_lively_task_by_vpid() checks PF_EXITING but this buys nothing, by the
> time we call find_get_context() this task can be already dead. To the point,
> cmpxchg() can succeed when the task has already done the last schedule().
>
> Change find_get_context() to populate task->perf_event_ctxp[] under
> task->perf_event_mutex, this way we can trust PF_EXITING because
> perf_event_exit_task() takes the same mutex.
>
> Also, change perf_event_exit_task_context() to use rcu_dereference().
> Probably this is not strictly needed, but with or without this change
> find_get_context() can race with setup_new_exec()->perf_event_exit_task(),
> rcu_dereference() looks better.

I think initially the idea was that this race couldn't happen because by
that time we would be unhashed from the pidhash and thus invisible for
new events, however from what I can make from the exit path we get
unhashed in exit_notify() which is _after_ perf_event_exit_task(), so
yes this looks to be a proper fix.

Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>

Peter Zijlstra

unread,

Jan 19, 2011, 2:00:02 PM1/19/11

to

On Wed, 2011-01-19 at 19:22 +0100, Oleg Nesterov wrote:

> perf_event_init_task() should clear child->perf_event_ctxp[] before
> anything else. Otherwise, if perf_event_init_context(perf_hw_context)
> fails, perf_event_free_task() can free perf_event_ctxp[perf_sw_context]
> copied from parent->perf_event_ctxp[] by dup_task_struct().
>
> Also move the initialization of perf_event_mutex and perf_event_list
> from perf_event_init_context() to perf_event_init_context().
>
> Signed-off-by: Oleg Nesterov <ol...@redhat.com>

Another fine find.

Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>

tip-bot for Oleg Nesterov

unread,

Jan 19, 2011, 2:30:01 PM1/19/11

to

Commit-ID: 8550d7cb6ed6c89add49c3b6ad4c753ab8a3d7f9
Gitweb: http://git.kernel.org/tip/8550d7cb6ed6c89add49c3b6ad4c753ab8a3d7f9
Author: Oleg Nesterov <ol...@redhat.com>
AuthorDate: Wed, 19 Jan 2011 19:22:28 +0100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Wed, 19 Jan 2011 20:04:28 +0100

perf: Fix perf_event_init_task()/perf_event_free_task() interaction

perf_event_init_task() should clear child->perf_event_ctxp[]
before anything else. Otherwise, if
perf_event_init_context(perf_hw_context) fails,
perf_event_free_task() can free perf_event_ctxp[perf_sw_context]
copied from parent->perf_event_ctxp[] by dup_task_struct().

Also move the initialization of perf_event_mutex and
perf_event_list from perf_event_init_context() to
perf_event_init_context().

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Frederic Weisbecker <fwei...@gmail.com>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Prasad <pra...@linux.vnet.ibm.com>
Cc: Roland McGrath <rol...@redhat.com>
LKML-Reference: <20110119182...@redhat.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
kernel/perf_event.c | 9 ++++-----
1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 4ec55ef..244ca3a 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -6446,11 +6446,6 @@ int perf_event_init_context(struct task_struct *child, int ctxn)

unsigned long flags;
int ret = 0;

- child->perf_event_ctxp[ctxn] = NULL;
-
- mutex_init(&child->perf_event_mutex);
- INIT_LIST_HEAD(&child->perf_event_list);
-
if (likely(!parent->perf_event_ctxp[ctxn]))
return 0;

@@ -6539,6 +6534,10 @@ int perf_event_init_task(struct task_struct *child)

tip-bot for Oleg Nesterov

unread,

Jan 19, 2011, 2:30:02 PM1/19/11

to

Commit-ID: dbe08d82ce3967ccdf459f7951d02589cf967300
Gitweb: http://git.kernel.org/tip/dbe08d82ce3967ccdf459f7951d02589cf967300
Author: Oleg Nesterov <ol...@redhat.com>
AuthorDate: Wed, 19 Jan 2011 19:22:07 +0100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Wed, 19 Jan 2011 20:04:27 +0100

perf: Fix find_get_context() vs perf_event_exit_task() race

find_get_context() must not install the new perf_event_context
if the task has already passed perf_event_exit_task().

If nothing else, this means the memory leak. Initially
ctx->refcount == 2, it is supposed that
perf_event_exit_task_context() should participate and do the
necessary put_ctx().

find_lively_task_by_vpid() checks PF_EXITING but this buys
nothing, by the time we call find_get_context() this task can be
already dead. To the point, cmpxchg() can succeed when the task
has already done the last schedule().

Change find_get_context() to populate task->perf_event_ctxp[]
under task->perf_event_mutex, this way we can trust PF_EXITING
because perf_event_exit_task() takes the same mutex.

Also, change perf_event_exit_task_context() to use
rcu_dereference(). Probably this is not strictly needed, but
with or without this change find_get_context() can race with
setup_new_exec()->perf_event_exit_task(), rcu_dereference()
looks better.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>

Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Frederic Weisbecker <fwei...@gmail.com>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Prasad <pra...@linux.vnet.ibm.com>
Cc: Roland McGrath <rol...@redhat.com>
LKML-Reference: <20110119182...@redhat.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>

---
kernel/perf_event.c | 34 ++++++++++++++++++++--------------
1 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 84522c7..4ec55ef 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c

@@ -6127,7 +6133,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)

Oleg Nesterov

unread,

Jan 20, 2011, 2:40:01 PM1/20/11

to

On 01/19, Oleg Nesterov wrote:
>
> Also. I believe there are more problems in perf_install_event(), but
> I need to recheck.

Help! I can't believe it can be so trivially wrong, but otoh I can't
understand how this can be correct.

So, ignoring details and !task case, __perf_install_in_context() does:

if (cpuctx->task_ctx || ctx->task != current)
return;

cpuctx->task_ctx = ctx;
event_sched_in(event);

Stupid question, what if this task has already passed
perf_event_exit_task() and thus it doesn't have ->perf_event_ctxp[] ?
Given that perf_event_context_sched_out() does nothing if !ctx, who
will event_sched_out() this event?

OK, even if I am right this is trivial, we just need the additional
check.

But, it seems, there is another problem. Forget about the exiting,
I can't understand why we can trust current in the code above.
With __ARCH_WANT_INTERRUPTS_ON_CTXSW schedule() does:

// sets cpuctx->task_ctx = NULL
perf_event_task_sched_out();

// enables irqs
prepare_lock_switch();

// updates current_task
switch_to();

What if IPI comes in the window before switch_to() ?

(the same questions for __perf_event_enable).

Peter Zijlstra

unread,

Jan 21, 2011, 7:20:01 AM1/21/11

to

On Thu, 2011-01-20 at 20:30 +0100, Oleg Nesterov wrote:
> On 01/19, Oleg Nesterov wrote:
> >
> > Also. I believe there are more problems in perf_install_event(), but
> > I need to recheck.
>
> Help! I can't believe it can be so trivially wrong, but otoh I can't
> understand how this can be correct.
>
> So, ignoring details and !task case, __perf_install_in_context() does:
>
> if (cpuctx->task_ctx || ctx->task != current)
> return;
>
> cpuctx->task_ctx = ctx;
> event_sched_in(event);
>
> Stupid question, what if this task has already passed
> perf_event_exit_task() and thus it doesn't have ->perf_event_ctxp[] ?
> Given that perf_event_context_sched_out() does nothing if !ctx, who
> will event_sched_out() this event?
>
> OK, even if I am right this is trivial, we just need the additional
> check.

Indeed (or do the cleanup from put_ctx(), but that's too complex a
change I think).

> But, it seems, there is another problem. Forget about the exiting,
> I can't understand why we can trust current in the code above.
> With __ARCH_WANT_INTERRUPTS_ON_CTXSW schedule() does:
>
> // sets cpuctx->task_ctx = NULL
> perf_event_task_sched_out();
>
> // enables irqs
> prepare_lock_switch();
>
>
> // updates current_task
> switch_to();
>
> What if IPI comes in the window before switch_to() ?
>
> (the same questions for __perf_event_enable).

Ingo, do you have any insights in that, I think you wrote all that
initially?

Ingo Molnar

unread,

Jan 21, 2011, 8:10:01 AM1/21/11

to

Not sure. Can an IPI come there - we have irqs disabled usually, dont we?

Thanks,

Ingo

Peter Zijlstra

unread,

Jan 21, 2011, 8:40:02 AM1/21/11

to

On Fri, 2011-01-21 at 14:03 +0100, Ingo Molnar wrote:
> > > But, it seems, there is another problem. Forget about the exiting,
> > > I can't understand why we can trust current in the code above.
> > > With __ARCH_WANT_INTERRUPTS_ON_CTXSW schedule() does:
> > >
> > > // sets cpuctx->task_ctx = NULL
> > > perf_event_task_sched_out();
> > >
> > > // enables irqs
> > > prepare_lock_switch();
> > >
> > >
> > > // updates current_task
> > > switch_to();
> > >
> > > What if IPI comes in the window before switch_to() ?
> > >
> > > (the same questions for __perf_event_enable).
> >
> > Ingo, do you have any insights in that, I think you wrote all that
> > initially?
>
> Not sure. Can an IPI come there - we have irqs disabled usually, dont we?

Ah, I think I see how that works:

__perf_event_task_sched_out()
perf_event_context_sched_out()
if (do_switch)
cpuctx->task_ctx = NULL;

vs

__perf_install_in_context()
if (cpu_ctx->task_ctx != ctx)

Oleg Nesterov

unread,

Jan 21, 2011, 9:40:03 AM1/21/11

to

On 01/21, Peter Zijlstra wrote:
>
> On Fri, 2011-01-21 at 14:03 +0100, Ingo Molnar wrote:
> > > > But, it seems, there is another problem. Forget about the exiting,
> > > > I can't understand why we can trust current in the code above.
> > > > With __ARCH_WANT_INTERRUPTS_ON_CTXSW schedule() does:
> > > >
> > > > // sets cpuctx->task_ctx = NULL
> > > > perf_event_task_sched_out();
> > > >
> > > > // enables irqs
> > > > prepare_lock_switch();
> > > >
> > > >
> > > > // updates current_task
> > > > switch_to();
> > > >
> > > > What if IPI comes in the window before switch_to() ?
> > > >
> > > > (the same questions for __perf_event_enable).
> > >
> > > Ingo, do you have any insights in that, I think you wrote all that
> > > initially?
> >
> > Not sure. Can an IPI come there - we have irqs disabled usually, dont we?

__ARCH_WANT_INTERRUPTS_ON_CTXSW enables irqs during prepare_task_switch()

> Ah, I think I see how that works:

Hmm. I don't...

>
> __perf_event_task_sched_out()
> perf_event_context_sched_out()
> if (do_switch)
> cpuctx->task_ctx = NULL;

exactly, this clears ->task_ctx

> vs
>
> __perf_install_in_context()
> if (cpu_ctx->task_ctx != ctx)

And then __perf_install_in_context() sets cpuctx->task_ctx = ctx,
because ctx->task == current && cpuctx->task_ctx == NULL.

Oleg.

Peter Zijlstra

unread,

Jan 21, 2011, 10:10:02 AM1/21/11

to

On Fri, 2011-01-21 at 15:26 +0100, Oleg Nesterov wrote:
>
> > Ah, I think I see how that works:
>
> Hmm. I don't...
>
> >
> > __perf_event_task_sched_out()
> > perf_event_context_sched_out()
> > if (do_switch)
> > cpuctx->task_ctx = NULL;
>
> exactly, this clears ->task_ctx
>
> > vs
> >
> > __perf_install_in_context()
> > if (cpu_ctx->task_ctx != ctx)
>
> And then __perf_install_in_context() sets cpuctx->task_ctx = ctx,
> because ctx->task == current && cpuctx->task_ctx == NULL.

Hrm,. right, so the comment suggests it should do what it doesn't :-)

It looks like Paul's a63eaf34ae60bd (perf_counter: Dynamically allocate
tasks' perf_counter_context struct), relevant hunk below, wrecked it:

@@ -568,11 +582,17 @@ static void __perf_install_in_context(void *info)
* If this is a task context, we need to check whether it is
* the current task context of this cpu. If not it has been
* scheduled out before the smp call arrived.
+ * Or possibly this is the right context but it isn't
+ * on this cpu because it had no counters.
*/
- if (ctx->task && cpuctx->task_ctx != ctx)
- return;
+ if (ctx->task && cpuctx->task_ctx != ctx) {
+ if (cpuctx->task_ctx || ctx->task != current)
+ return;
+ cpuctx->task_ctx = ctx;
+ }

spin_lock_irqsave(&ctx->lock, flags);
+ ctx->is_active = 1;
update_context_time(ctx);

/*

I can't really seem to come up with a sane test that isn't racy with
something, my cold seems to have clogged not only my nose :/

Ingo Molnar

unread,

Jan 21, 2011, 10:40:02 AM1/21/11

to

hm, this one's causing:

[ 25.557579] ===================================================
[ 25.561361] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 25.561361] ---------------------------------------------------
[ 25.561361] kernel/perf_event.c:6136 invoked rcu_dereference_check() without protection!
[ 25.561361]
[ 25.561361] other info that might help us debug this:
[ 25.561361]
[ 25.561361]
[ 25.561361] rcu_scheduler_active = 1, debug_locks = 0
[ 25.561361] no locks held by true/1397.
[ 25.561361]
[ 25.561361] stack backtrace:
[ 25.561361] Pid: 1397, comm: true Not tainted 2.6.38-rc1-tip+ #86752
[ 25.561361] Call Trace:
[ 25.561361] [<ffffffff8106cd98>] ? lockdep_rcu_dereference+0xaa/0xb3
[ 25.561361] [<ffffffff810b34ee>] ? perf_event_exit_task+0x118/0x22a
[ 25.561361] [<ffffffff811133b8>] ? free_fs_struct+0x44/0x48
[ 25.561361] [<ffffffff810434ef>] ? do_exit+0x2c8/0x770
[ 25.561361] [<ffffffff813a52ed>] ? retint_swapgs+0xe/0x13
[ 25.561361] [<ffffffff81043c3c>] ? do_group_exit+0x82/0xad
[ 25.561361] [<ffffffff81043c7e>] ? sys_exit_group+0x17/0x1b
[ 25.561361] [<ffffffff81002acb>] ? system_call_fastpath+0x16/0x1b

Any ideas?

Thanks,

Ingo

Oleg Nesterov

unread,

Jan 21, 2011, 11:10:01 AM1/21/11

to

On 01/21, Ingo Molnar wrote:
>
> * tip-bot for Oleg Nesterov <ol...@redhat.com> wrote:
>
> > @@ -6127,7 +6133,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
> > * scheduled, so we are now safe from rescheduling changing
> > * our context.
> > */
> > - child_ctx = child->perf_event_ctxp[ctxn];
> > + child_ctx = rcu_dereference(child->perf_event_ctxp[ctxn]);
> > task_ctx_sched_out(child_ctx, EVENT_ALL);
> >
> > /*
>
> hm, this one's causing:
>
> [ 25.557579] ===================================================
> [ 25.561361] [ INFO: suspicious rcu_dereference_check() usage. ]

Oh, indeed, I am stupid!

I added rcu_dereference() because it has smp_read_barrier_depends(),
but I forgot about rcu_dereference_check().

I'll send the fix soon...

Oleg.

Oleg Nesterov

unread,

Jan 21, 2011, 1:10:04 PM1/21/11

to

On 01/21, Oleg Nesterov wrote:
>
> In theory, almost every user of task->child->perf_event_ctxp[]
> is wrong. find_get_context() can install the new context at any
> moment, we need read_barrier_depends().

And perhaps it makes sense to fix them all, although the problem
is only theoretical.

> dbe08d82ce3967ccdf459f7951d02589cf967300 "perf: Fix
> find_get_context() vs perf_event_exit_task() race" added
> rcu_dereference() into perf_event_exit_task_context() to make
> the precedent, but this makes __rcu_dereference_check() unhappy.
> Use rcu_dereference_raw() to shut up the warning.

But rcu_dereference_raw() looks a bit confusing, and it is not
very convenient to use read_barrier_depends() directly.

Paul, may be it makes sense to add the new trivial helper which
can be used instead?

Yes, this is only cosmetic issue, I know ;)

Frederic Weisbecker

unread,

Jan 21, 2011, 3:50:01 PM1/21/11

to

What do you think about the following (only compile tested yet), it
probably needs more comments, factorizing the checks betwee, perf_event_enable()
and perf_install_in_context(), build-cond against __ARCH_WANT_INTERRUPTS_ON_CTXSW,
but the (good or bad) idea is there.

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index c5fa717..e97472b 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -928,6 +928,8 @@ static void add_event_to_ctx(struct perf_event *event,
event->tstamp_stopped = tstamp;
}

+static DEFINE_PER_CPU(int, task_events_schedulable);
+
/*
* Cross CPU call to install and enable a performance event
*
@@ -949,7 +951,8 @@ static void __perf_install_in_context(void *info)
* on this cpu because it had no events.
*/

if (ctx->task && cpuctx->task_ctx != ctx) {

- if (cpuctx->task_ctx || ctx->task != current)

+ if (cpuctx->task_ctx || ctx->task != current

+ || !__get_cpu_var(task_events_schedulable))

return;
cpuctx->task_ctx = ctx;
}

@@ -1091,7 +1094,8 @@ static void __perf_event_enable(void *info)
* event's task is the current task on this cpu.
*/

if (ctx->task && cpuctx->task_ctx != ctx) {

- if (cpuctx->task_ctx || ctx->task != current)

+ if (cpuctx->task_ctx || ctx->task != current

+ || !__get_cpu_var(task_events_schedulable))

return;
cpuctx->task_ctx = ctx;
}

@@ -1414,6 +1418,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
{
int ctxn;

+ __get_cpu_var(task_events_schedulable) = 0;
+ barrier(); /* Must be visible by enable/install_in_context IPI */
+
for_each_task_context_nr(ctxn)
perf_event_context_sched_out(task, ctxn, next);
}
@@ -1587,6 +1594,8 @@ void __perf_event_task_sched_in(struct task_struct *task)
struct perf_event_context *ctx;
int ctxn;

+ __get_cpu_var(task_events_schedulable) = 1;
+
for_each_task_context_nr(ctxn) {
ctx = task->perf_event_ctxp[ctxn];
if (likely(!ctx))
@@ -5964,6 +5973,18 @@ SYSCALL_DEFINE5(perf_event_open,
WARN_ON_ONCE(ctx->parent_ctx);
mutex_lock(&ctx->mutex);

+ /*
+ * Every pending sched switch must finish so that
+ * we ensure every pending calls to perf_event_task_sched_in/out are
+ * finished. We ensure the next ones will correctly handle the
+ * perf_task_events label and then the task_events_schedulable
+ * state. So perf_install_in_context() won't install events
+ * in the tiny race window between perf_event_task_sched_out()
+ * and perf_event_task_sched_in() in the __ARCH_WANT_INTERRUPTS_ON_CTXSW
+ * case.
+ */
+ synchronize_sched();
+
if (move_group) {
perf_install_in_context(ctx, group_leader, cpu);
get_ctx(ctx);

Paul E. McKenney

unread,

Jan 21, 2011, 5:00:01 PM1/21/11

to

On Fri, Jan 21, 2011 at 06:53:45PM +0100, Oleg Nesterov wrote:
> On 01/21, Oleg Nesterov wrote:
> >
> > In theory, almost every user of task->child->perf_event_ctxp[]
> > is wrong. find_get_context() can install the new context at any
> > moment, we need read_barrier_depends().
>
> And perhaps it makes sense to fix them all, although the problem
> is only theoretical.
>
> > dbe08d82ce3967ccdf459f7951d02589cf967300 "perf: Fix
> > find_get_context() vs perf_event_exit_task() race" added
> > rcu_dereference() into perf_event_exit_task_context() to make
> > the precedent, but this makes __rcu_dereference_check() unhappy.
> > Use rcu_dereference_raw() to shut up the warning.
>
> But rcu_dereference_raw() looks a bit confusing, and it is not
> very convenient to use read_barrier_depends() directly.
>
> Paul, may be it makes sense to add the new trivial helper which
> can be used instead?
>
> Yes, this is only cosmetic issue, I know ;)

Cosmetic issues can be pretty important to the poor guy trying to read
the code. ;-)

What keeps the structure that rcu_dereference_raw() returns a pointer
to from going away? Best would be if a lockdep condition could be
constructed from the answer to this question and added to the appropriate
rcu_dereference() primitive.

Thanx, Paul

tip-bot for Oleg Nesterov

unread,

Jan 21, 2011, 5:20:02 PM1/21/11

to

Commit-ID: 806839b22cbda90176d7f8d421889bddd7826e93
Gitweb: http://git.kernel.org/tip/806839b22cbda90176d7f8d421889bddd7826e93
Author: Oleg Nesterov <ol...@redhat.com>
AuthorDate: Fri, 21 Jan 2011 18:45:47 +0100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Fri, 21 Jan 2011 22:08:16 +0100

perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/

In theory, almost every user of task->child->perf_event_ctxp[]
is wrong. find_get_context() can install the new context at any
moment, we need read_barrier_depends().

dbe08d82ce3967ccdf459f7951d02589cf967300 "perf: Fix

find_get_context() vs perf_event_exit_task() race" added
rcu_dereference() into perf_event_exit_task_context() to make
the precedent, but this makes __rcu_dereference_check() unhappy.
Use rcu_dereference_raw() to shut up the warning.

Reported-by: Ingo Molnar <mi...@elte.hu>
Signed-off-by: Oleg Nesterov <ol...@redhat.com>
Cc: ac...@redhat.com
Cc: pau...@samba.org
Cc: st...@rowland.harvard.edu
Cc: a.p.zi...@chello.nl
Cc: fwei...@gmail.com
Cc: rol...@redhat.com
Cc: pra...@linux.vnet.ibm.com
Cc: Paul E. McKenney <pau...@linux.vnet.ibm.com>
LKML-Reference: <2011012117...@redhat.com>

Signed-off-by: Ingo Molnar <mi...@elte.hu>
---

kernel/perf_event.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index c5fa717..126a302 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -6136,7 +6136,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)

* scheduled, so we are now safe from rescheduling changing
* our context.
*/

- child_ctx = rcu_dereference(child->perf_event_ctxp[ctxn]);
+ child_ctx = rcu_dereference_raw(child->perf_event_ctxp[ctxn]);
task_ctx_sched_out(child_ctx, EVENT_ALL);

/*

Oleg Nesterov

unread,

Jan 24, 2011, 7:00:02 AM1/24/11

to

On 01/21, Frederic Weisbecker wrote:
>
> +static DEFINE_PER_CPU(int, task_events_schedulable);

Yes, I think this can work. I thought about this too. The only problem,
this doesn't make the whole code more understandable ;)

> @@ -1587,6 +1594,8 @@ void __perf_event_task_sched_in(struct task_struct *task)
> struct perf_event_context *ctx;
> int ctxn;
>
> + __get_cpu_var(task_events_schedulable) = 1;
> +
> for_each_task_context_nr(ctxn) {
> ctx = task->perf_event_ctxp[ctxn];
> if (likely(!ctx))

This doesn't look right. We should set task_events_schedulable
_after_ perf_event_context_sched_in(), otherwise we have the similar
race with next.

rq->curr and current_task were already updated. __perf_install_in_context
should not set "cpuctx->task_ctx = next" before perf_event_context_sched_in(),
it does nothing if cpuctx->task_ctx == ctx.

OTOH, if we set task_events_schedulable after for_each_task_context_nr(),
then we have another race with next, but this race is minor. If
find_get_context() + perf_install_in_context() happen in this window,
the new event won't be scheduled until next reschedules itself.

> + /*
> + * Every pending sched switch must finish so that
> + * we ensure every pending calls to perf_event_task_sched_in/out are
> + * finished. We ensure the next ones will correctly handle the
> + * perf_task_events label and then the task_events_schedulable
> + * state. So perf_install_in_context() won't install events
> + * in the tiny race window between perf_event_task_sched_out()
> + * and perf_event_task_sched_in() in the __ARCH_WANT_INTERRUPTS_ON_CTXSW
> + * case.
> + */
> + synchronize_sched();

Yes, if perf_task_events was zero before perf_event_alloc(), then it
is possible that task_events_schedulable == 1 while schedule() is in
progress. perf_event_create_kernel_counter() needs this too.

Frederic, All, can't we simplify this?

First, we modify __perf_install_in_context() so that it never tries
to install the event into !is_active context. IOW, it never tries
to set cpuctx->task_ctx = ctx.

Then we add the new trivial helper stop_resched_task(task) which
simply wakeups the stop thread on task_cpu(task), and thus forces
this task to reschedule.

Now,

static void
perf_install_in_context(struct perf_event_context *ctx,
struct perf_event *event,
int cpu)
{
struct task_struct *task = ctx->task;

event->ctx = ctx;

if (!task) {
/*
* Per cpu events are installed via an smp call and
* the install is always successful.
*/
smp_call_function_single(cpu, __perf_install_in_context,
event, 1);
return;
}

for (;;) {
bool done, need_resched = false;

raw_spin_lock_irq(&ctx->lock);
done = !list_empty(&event->group_entry);
if (!done && !ctx->is_active) {
add_event_to_ctx(event, ctx);
need_resched = task_running(task);
done = true;
}
raw_spin_unlock_irq(&ctx->lock);

if (done) {
if (need_resched)
stop_resched_task(task);
break;
}

task_oncpu_function_call(task, __perf_install_in_context,
event);
}
}

Yes, stop_resched_task() can't help if this task itself is the stop thread.
But the stop thread shouldn't run for a long time without rescheduling,
otherwise we already have the problems.

Do you all think this makes any sense?

Oleg.

Oleg Nesterov

unread,

Jan 24, 2011, 7:10:01 AM1/24/11

to

On 01/21, Paul E. McKenney wrote:
>
> On Fri, Jan 21, 2011 at 06:53:45PM +0100, Oleg Nesterov wrote:
> >
> > But rcu_dereference_raw() looks a bit confusing, and it is not
> > very convenient to use read_barrier_depends() directly.
> >
> > Paul, may be it makes sense to add the new trivial helper which
> > can be used instead?
> >
> > Yes, this is only cosmetic issue, I know ;)
>
> Cosmetic issues can be pretty important to the poor guy trying to read
> the code. ;-)

Agreed!

> What keeps the structure that rcu_dereference_raw() returns a pointer
> to from going away?

It can't go away, current owns its ->perf_event_ctxp[] pointers. But
the pointer can be installed at any time by sys_perf_event_open().

Currently the code does

ctx = current->perf_event_ctxp[ctxn];
if (ctx)
do_something(ctx);

and in theory we need smp_read_barrier_depends() in between.

> Best would be if a lockdep condition could be
> constructed from the answer to this question and added to the appropriate
> rcu_dereference() primitive.

In this case the condition is "true", so we can use rcu_dereference_raw().
The only problem, it looks confusing. Especially because you actually
need rcu_read_lock() if you look at not_current_task->perf_event_ctxp[].

Oleg.

Oleg Nesterov

unread,

Jan 26, 2011, 1:30:02 PM1/26/11

to

On 01/24, Oleg Nesterov wrote:
>
> Frederic, All, can't we simplify this?

Well, to clarify, it looks simpler to me ;)

But if you don't like this approach, lets use task_events_schedulable flag.

> First, we modify __perf_install_in_context() so that it never tries
> to install the event into !is_active context. IOW, it never tries
> to set cpuctx->task_ctx = ctx.
>
> Then we add the new trivial helper stop_resched_task(task) which
> simply wakeups the stop thread on task_cpu(task), and thus forces
> this task to reschedule.
>

> ...

>
> Yes, stop_resched_task() can't help if this task itself is the stop thread.
> But the stop thread shouldn't run for a long time without rescheduling,
> otherwise we already have the problems.

Please see the untested patch below. It doesn't change perf_event_enable(),
only perf_install_in_context(). Just for early review to know your opinion.
To simplify the reading, here is the code:

void task_force_schedule(struct task_struct *p)
{
struct rq *rq;

preempt_disable();
rq = task_rq(p);
if (rq->curr == p)
wake_up_process(rq->stop);
preempt_enable();
}

static void
perf_install_in_context(struct perf_event_context *ctx,
struct perf_event *event,
int cpu)
{
struct task_struct *task = ctx->task;

event->ctx = ctx;

if (!task) {
/*
* Per cpu events are installed via an smp call and
* the install is always successful.
*/
smp_call_function_single(cpu, __perf_install_in_context,
event, 1);
return;
}

for (;;) {
raw_spin_lock_irq(&ctx->lock);
/*
* The lock prevents that this context is
* scheduled in, we can add the event safely.
*/
if (!ctx->is_active)
add_event_to_ctx(event, ctx);
raw_spin_unlock_irq(&ctx->lock);

if (event->attach_state & PERF_ATTACH_CONTEXT) {
task_force_schedule(task);
break;
}

task_oncpu_function_call(task, __perf_install_in_context,
event);
if (event->attach_state & PERF_ATTACH_CONTEXT)
break;
}
}

Oleg.

include/linux/sched.h | 1 +
kernel/sched.c | 11 +++++++++++
kernel/perf_event.c | 49 +++++++++++++++++++++----------------------------
3 files changed, 33 insertions(+), 28 deletions(-)

--- perf/include/linux/sched.h~1_force_resched 2011-01-14 18:21:04.000000000 +0100
+++ perf/include/linux/sched.h 2011-01-26 17:54:28.000000000 +0100
@@ -2584,6 +2584,7 @@ static inline void inc_syscw(struct task
extern void task_oncpu_function_call(struct task_struct *p,
void (*func) (void *info), void *info);

+extern void task_force_schedule(struct task_struct *p);

#ifdef CONFIG_MM_OWNER
extern void mm_update_next_owner(struct mm_struct *mm);
--- perf/kernel/sched.c~1_force_resched 2011-01-20 20:37:11.000000000 +0100
+++ perf/kernel/sched.c 2011-01-26 17:52:42.000000000 +0100
@@ -1968,6 +1968,17 @@ void sched_set_stop_task(int cpu, struct
}
}

+void task_force_schedule(struct task_struct *p)
+{
+ struct rq *rq;
+
+ preempt_disable();
+ rq = task_rq(p);
+ if (rq->curr == p)
+ wake_up_process(rq->stop);
+ preempt_enable();
+}
+
/*
* __normal_prio - return the priority that is based on the static prio
*/
--- perf/kernel/perf_event.c~2_install_ctx_via_resched 2011-01-21 18:41:02.000000000 +0100
+++ perf/kernel/perf_event.c 2011-01-26 18:32:30.000000000 +0100
@@ -943,16 +943,10 @@ static void __perf_install_in_context(vo

/*

* If this is a task context, we need to check whether it is

- * the current task context of this cpu. If not it has been
- * scheduled out before the smp call arrived.
- * Or possibly this is the right context but it isn't
- * on this cpu because it had no events.
+ * the current task context of this cpu.

*/
- if (ctx->task && cpuctx->task_ctx != ctx) {

- if (cpuctx->task_ctx || ctx->task != current)
- return;
- cpuctx->task_ctx = ctx;
- }

+ if (ctx->task && cpuctx->task_ctx != ctx)

+ return;

raw_spin_lock(&ctx->lock);
ctx->is_active = 1;
@@ -1030,27 +1024,26 @@ perf_install_in_context(struct perf_even
return;
}

-retry:
- task_oncpu_function_call(task, __perf_install_in_context,
- event);
-
- raw_spin_lock_irq(&ctx->lock);
- /*
- * we need to retry the smp call.
- */
- if (ctx->is_active && list_empty(&event->group_entry)) {
+ for (;;) {
+ raw_spin_lock_irq(&ctx->lock);
+ /*
+ * The lock prevents that this context is
+ * scheduled in, we can add the event safely.
+ */
+ if (!ctx->is_active)
+ add_event_to_ctx(event, ctx);
raw_spin_unlock_irq(&ctx->lock);
- goto retry;
- }

- /*
- * The lock prevents that this context is scheduled in so we
- * can add the event safely, if it the call above did not
- * succeed.
- */
- if (list_empty(&event->group_entry))
- add_event_to_ctx(event, ctx);
- raw_spin_unlock_irq(&ctx->lock);
+ if (event->attach_state & PERF_ATTACH_CONTEXT) {
+ task_force_schedule(task);
+ break;
+ }
+
+ task_oncpu_function_call(task, __perf_install_in_context,
+ event);
+ if (event->attach_state & PERF_ATTACH_CONTEXT)
+ break;
+ }
}

/*

Oleg Nesterov

unread,

Jan 26, 2011, 2:00:01 PM1/26/11

to

On 01/26, Oleg Nesterov wrote:
>
> Please see the untested patch below. It doesn't change perf_event_enable(),
> only perf_install_in_context().

Forgot to mention... Also, it doesn't try to fix the race with do_exit(),
this needs another change.

And, damn, can't resist. This is mostly cosmetic issue, but I feel
discomfort every time I look at task_oncpu_function_call(). It _looks_
obviously wrong, even if the problem doesn't exist in practice. I'll
send the pedantic fix to keep the maintainers busy ;)

Oleg.

Oleg Nesterov

unread,

Jan 26, 2011, 2:10:02 PM1/26/11

to

kick_process() and task_oncpu_function_call() are not right, they
can use the dead CPU for smp_send_reschedule/smp_call_function_single
if try_to_wake_up() makes this task running after we read task_cpu().

Given that task_curr() is inline this problem is pure theoretical,
compiler doesn't read task_cpu() twice, but the code looks wrong.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

kernel/sched.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- perf/kernel/sched.c~task_cpu_vs_task_curr 2011-01-26 19:26:40.000000000 +0100
+++ perf/kernel/sched.c 2011-01-26 19:26:58.000000000 +0100
@@ -2269,7 +2269,7 @@ void kick_process(struct task_struct *p)

preempt_disable();
cpu = task_cpu(p);
- if ((cpu != smp_processor_id()) && task_curr(p))
+ if ((cpu != smp_processor_id()) && (cpu_curr(cpu) == p))
smp_send_reschedule(cpu);
preempt_enable();
}
@@ -2292,7 +2292,7 @@ void task_oncpu_function_call(struct tas

preempt_disable();
cpu = task_cpu(p);
- if (task_curr(p))
+ if (cpu_curr(cpu) == p)
smp_call_function_single(cpu, func, info, 1);
preempt_enable();

Peter Zijlstra

unread,

Jan 26, 2011, 2:10:02 PM1/26/11

to

On Wed, 2011-01-26 at 19:49 +0100, Oleg Nesterov wrote:
> On 01/26, Oleg Nesterov wrote:
> >
> > Please see the untested patch below. It doesn't change perf_event_enable(),
> > only perf_install_in_context().
>
> Forgot to mention... Also, it doesn't try to fix the race with do_exit(),
> this needs another change.
>
> And, damn, can't resist. This is mostly cosmetic issue, but I feel
> discomfort every time I look at task_oncpu_function_call(). It _looks_
> obviously wrong, even if the problem doesn't exist in practice. I'll
> send the pedantic fix to keep the maintainers busy ;)

I've been trying to sit down and work my way through it today, your last
suggestion very nearly seemed to make sense, but I kept getting
distracted.

FWIW I think perf_event_enable() has the very same issue...

Peter Zijlstra

unread,

Jan 26, 2011, 2:40:01 PM1/26/11

to

On Wed, 2011-01-26 at 20:33 +0100, Peter Zijlstra wrote:

> Wouldn't something like the below cure things too?

> +struct task_function_call {
> + struct task_struct *p;
> + void (*func)(void *info);
> + void *info;
> +};
> +
> +void task_function_trampoline(void *data)
> +{
> + struct task_function_call *tfc = data;
> +
> + if (this_rq()->curr != tfc->p)
> + return;
> +
> + tfc->func(tfc->data);
> +}

tfc->info of course ;-)

Oleg Nesterov

unread,

Jan 26, 2011, 4:30:02 PM1/26/11

to

On 01/26, Peter Zijlstra wrote:

>
> On Wed, 2011-01-26 at 20:05 +0100, Peter Zijlstra wrote:
> > On Wed, 2011-01-26 at 19:49 +0100, Oleg Nesterov wrote:
> > > On 01/26, Oleg Nesterov wrote:
> > > >
> > > > Please see the untested patch below. It doesn't change perf_event_enable(),
> > > > only perf_install_in_context().
> > >
> > > Forgot to mention... Also, it doesn't try to fix the race with do_exit(),
> > > this needs another change.
> > >
> > > And, damn, can't resist. This is mostly cosmetic issue, but I feel
> > > discomfort every time I look at task_oncpu_function_call(). It _looks_
> > > obviously wrong, even if the problem doesn't exist in practice. I'll
> > > send the pedantic fix to keep the maintainers busy ;)
> >
> > I've been trying to sit down and work my way through it today, your last
> > suggestion very nearly seemed to make sense, but I kept getting
> > distracted.
> >
> > FWIW I think perf_event_enable() has the very same issue...

Yes, yes, note the "doesn't change perf_event_enable()" above.

In fact, I _suspect_ perf_event_enable() has more problems, but
I need to recheck.

> +void task_function_trampoline(void *data)

> +{
> + struct task_function_call *tfc = data;
> +
> + if (this_rq()->curr != tfc->p)
> + return;

Yes, I was thinking about checking rq->curr too, but this doesn't
really help. This closes the race with "prev", but we have the similar
race with "next".

__perf_install_in_context() should not set ->task_ctx before next
does perf_event_context_sched_in(). Otherwise it will do nothing,
it checks cpuctx->task_ctx == ctx.

Oleg.

Oleg Nesterov

unread,

Jan 26, 2011, 4:50:02 PM1/26/11

to

On 01/26, Oleg Nesterov wrote:
>
> > +void task_function_trampoline(void *data)
> > +{
> > + struct task_function_call *tfc = data;
> > +
> > + if (this_rq()->curr != tfc->p)
> > + return;
>
> Yes, I was thinking about checking rq->curr too, but this doesn't
> really help. This closes the race with "prev", but we have the similar
> race with "next".
>
> __perf_install_in_context() should not set ->task_ctx before next
> does perf_event_context_sched_in(). Otherwise it will do nothing,
> it checks cpuctx->task_ctx == ctx.

But of course, we can add rq->in_context_switch or something. This
is more or less equal to Frederic's per-cpu task_events_schedulable
but simpler, because this doesn't depend on perf_task_events.

This is what I had in mind initially but I didn't dare to add the
new member into rq, it is only needed for perf.

Peter Zijlstra

unread,

Jan 27, 2011, 5:40:02 AM1/27/11

to

On Wed, 2011-01-26 at 22:33 +0100, Oleg Nesterov wrote:
>
> This is what I had in mind initially but I didn't dare to add the
> new member into rq, it is only needed for perf.

Right, but its a weakness in the task_oncpu_function_call()
implementation, wouldn't any user run into this problem eventually?

Peter Zijlstra

unread,

Jan 27, 2011, 7:30:02 AM1/27/11

to

On Thu, 2011-01-27 at 11:32 +0100, Peter Zijlstra wrote:
> On Wed, 2011-01-26 at 22:33 +0100, Oleg Nesterov wrote:
> >
> > This is what I had in mind initially but I didn't dare to add the
> > new member into rq, it is only needed for perf.
>
> Right, but its a weakness in the task_oncpu_function_call()
> implementation, wouldn't any user run into this problem eventually?

I can't seem to avoid having to add this rq member, but like you said,
we only need to do that when __ARCH_WANT_INTERRUPTS_ON_CTXSW.
I can't seem to avoid having to add this rq member, but like you said,
we only

We still need to validate p is actually current when the IPI happens,
the test might return true in task_oncpu_function_call() but be false by
the time we process the IPI.

So this should avoid us calling @func when @p isn't (fully) running.

---
kernel/sched.c | 46 ++++++++++++++++++++++++++++++++++++++++------
1 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..fbff6a8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -490,7 +490,10 @@ struct rq {
struct task_struct *curr, *idle, *stop;
unsigned long next_balance;
struct mm_struct *prev_mm;
-
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+ int in_ctxsw;
+#endif
+
u64 clock;
u64 clock_task;

@@ -2265,6 +2268,29 @@ void kick_process(struct task_struct *p)
EXPORT_SYMBOL_GPL(kick_process);
#endif /* CONFIG_SMP */

+struct task_function_call {
+ struct task_struct *p;
+ void (*func)(void *info);
+ void *info;
+};
+

+void task_function_trampoline(void *data)
+{
+ struct task_function_call *tfc = data;

+ struct task_struct *p = tfc->p;
+ struct rq *rq = this_rq();
+
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+ if (rq->in_ctxsw)
+ return;
+#endif
+
+ if (rq->curr != p)
+ return;
+
+ tfc->func(tfc->info);
+}
+
/**
* task_oncpu_function_call - call a function on the cpu on which a task runs
* @p: the task to evaluate
@@ -2278,11 +2304,16 @@ void task_oncpu_function_call(struct task_struct *p,

void (*func) (void *info), void *info)

{
int cpu;
+ struct task_function_call data = {
+ .p = p,
+ .func = func,
+ .info = info,
+ };

preempt_disable();
cpu = task_cpu(p);
if (task_curr(p))
- smp_call_function_single(cpu, func, info, 1);
+ smp_call_function_single(cpu, task_function_trampoline, &data, 1);
preempt_enable();
}

@@ -2776,9 +2807,15 @@ static inline void
prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+ rq->in_ctxsw = 1;
+#endif
+ sched_info_switch(prev, next);
+ perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
+ trace_sched_switch(prev, next);
}

/**
@@ -2823,6 +2860,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
perf_event_task_sched_in(current);
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
local_irq_enable();
+ rq->in_ctxsw = 0;
#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
finish_lock_switch(rq, prev);

@@ -2911,7 +2949,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
struct mm_struct *mm, *oldmm;

prepare_task_switch(rq, prev, next);
- trace_sched_switch(prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*
@@ -3989,9 +4026,6 @@ need_resched_nonpreemptible:
rq->skip_clock_update = 0;

if (likely(prev != next)) {
- sched_info_switch(prev, next);
- perf_event_task_sched_out(prev, next);
-
rq->nr_switches++;
rq->curr = next;
++*switch_count;

Peter Zijlstra

unread,

Jan 27, 2011, 8:20:02 AM1/27/11

to

Right, so the fact of introducing extra scheduling makes me feel
uncomfortable... the whole purpose is to observe without perturbing (as
much as possible).

So the whole crux of the matter is adding a ctx to a running process. If
the ctx exists, ->is_active will be tracked properly and much of the
problem goes away.

rcu_assign_pointer(task->perf_event_ctx[n], new_ctx);
task_oncpu_function_call(task, __perf_install_in_context, event);

Should I think suffice to get the ctx in sync with the task state, we've
got the following cases:
1) task is in the middle of scheduling in
2) task is in the middle of scheduling out
3) task is running

Without __ARCH_WANT_INTERRUPT_ON_CTXSW everything is boring and works,
1: the IPI will be delayed until 3, 2: the IPI finds another task and
the next schedule in will sort things.

With, however, things are more interesting. 2 seems to be adequately
covered by the patch I just send, the IPI will bail and the next
sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
covered, the IPI will bail, leaving us stranded.

To fix this it seems we need to make task_oncpu_function_call() wait
until the ctx is done, while (cpu_rq(cpu)->in_ctxsw) cpu_relax(); before
sending the IPI like, however that would require adding a few memory
barriers I think...

/me goes search for implied barriers around there.

Peter Zijlstra

unread,

Jan 27, 2011, 9:30:01 AM1/27/11

to

On Thu, 2011-01-27 at 14:14 +0100, Peter Zijlstra wrote:
>
> With, however, things are more interesting. 2 seems to be adequately
> covered by the patch I just send, the IPI will bail and the next
> sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
> covered, the IPI will bail, leaving us stranded.

blergh, so the race condition specific for perf can be cured by putting
the ->in_ctxsw = 0, under the local_irq_disable(). When we hit early,
the perf_event_task_sched_in() will do the job and we can simply bail in
the IPI. If he hit late, the IPI will be delayed until after, and we'll
be case 3 again.

More generic task_oncpu_function_call() users, say using preempt
notifiers will have to deal with the fact that the sched_in notifier
runs after we unlock/enable irqs.

So I was contemplating if we could make things work by placing
rq->nr_switches++; _after_ context_switch() and use:

rq->curr != current
mb() /* implied by ctxsw? */
rq->nr_switches++

to do something like:

nr_switches = rq->nr_switches;
smp_rmb();
if (rq->curr != current) {
smp_rmb();
while (rq->nr_switches == nr_switches)
cpu_relax();
}

to synchronize things, but then my head hurt.. mostly because you can
only use rq->curr != current on the local cpu, in which case spinning
will deadlock you.

The 'solution' seemed to be to do that test from an IPI, and return the
state in struct task_function_call, then spin on the other cpu..

So I've likely fallen off a cliff somewhere along the line, but just in
case, here's the patch:
---

kernel/sched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..31f8d75 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2265,6 +2265,30 @@ void kick_process(struct task_struct *p)

EXPORT_SYMBOL_GPL(kick_process);
#endif /* CONFIG_SMP */

+struct task_function_call {
+ struct task_struct *p;
+ void (*func)(void *info);
+ void *info;

+ int ret;

+};
+
+void task_function_trampoline(void *data)
+{
+ struct task_function_call *tfc = data;
+ struct task_struct *p = tfc->p;
+ struct rq *rq = this_rq();
+

+ if (rq->curr != current) {
+ tfc->ret = 1;
+ return;
+ }

+
+ if (rq->curr != p)
+ return;
+
+ tfc->func(tfc->info);
+}
+
/**
* task_oncpu_function_call - call a function on the cpu on which a task runs
* @p: the task to evaluate

@@ -2277,12 +2301,30 @@ EXPORT_SYMBOL_GPL(kick_process);

void task_oncpu_function_call(struct task_struct *p,
void (*func) (void *info), void *info)

{

+ struct task_function_call data = {
+ .p = p,
+ .func = func,
+ .info = info,
+ };

+ unsigned long nr_switches;
+ struct rq *rq;
int cpu;

preempt_disable();
- cpu = task_cpu(p);
- if (task_curr(p))

- smp_call_function_single(cpu, func, info, 1);

+again:
+ data.ret = 0;
+ rq = task_rq(p);
+ nr_switches = rq->nr_switches;
+ smp_rmb();
+ if (task_curr(p)) {
+ smp_call_function_single(cpu_of(rq),
+ task_function_trampoline, &data, 1);
+ if (data.ret == 1) {
+ while (rq->nr_switches == nr_switches)
+ cpu_relax();
+ goto again;
+ }
+ }
preempt_enable();
}

@@ -2776,9 +2818,12 @@ static inline void

prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{

+ sched_info_switch(prev, next);
+ perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
+ trace_sched_switch(prev, next);
}

/**

@@ -2911,7 +2956,6 @@ context_switch(struct rq *rq, struct task_struct *prev,

struct mm_struct *mm, *oldmm;

prepare_task_switch(rq, prev, next);
- trace_sched_switch(prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*

@@ -3989,10 +4033,6 @@ need_resched_nonpreemptible:

rq->skip_clock_update = 0;

if (likely(prev != next)) {
- sched_info_switch(prev, next);
- perf_event_task_sched_out(prev, next);
-
- rq->nr_switches++;
rq->curr = next;
++*switch_count;

@@ -4005,6 +4045,7 @@ need_resched_nonpreemptible:
*/
cpu = smp_processor_id();
rq = cpu_rq(cpu);
+ rq->nr_switches++;
} else
raw_spin_unlock_irq(&rq->lock);

Peter Zijlstra

unread,

Jan 27, 2011, 10:00:03 AM1/27/11

to

On Thu, 2011-01-27 at 15:28 +0100, Peter Zijlstra wrote:
>
> <crazy idea here>
>

> So I've likely fallen off a cliff somewhere along the line, but just in

> case, here's the patch:

Its completely broken, ignore this :-)

Oleg Nesterov

unread,

Jan 27, 2011, 11:10:02 AM1/27/11

to

On 01/27, Peter Zijlstra wrote:
>
> On Wed, 2011-01-26 at 22:33 +0100, Oleg Nesterov wrote:
> >
> > This is what I had in mind initially but I didn't dare to add the
> > new member into rq, it is only needed for perf.
>
> Right, but its a weakness in the task_oncpu_function_call()
> implementation, wouldn't any user run into this problem eventually?

I think that other users are fine, they do not try to change ctx.

OTOH, probably your change in task_oncpu_function_call() makes sense
anyway, this way func() can never have some other subtle problems
witht context_switch in progress.

Oleg.

Oleg Nesterov

unread,

Jan 27, 2011, 11:20:02 AM1/27/11

to

On 01/27, Peter Zijlstra wrote:
>

> +void task_function_trampoline(void *data)
> +{
> + struct task_function_call *tfc = data;
> + struct task_struct *p = tfc->p;
> + struct rq *rq = this_rq();
> +
> +#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> + if (rq->in_ctxsw)
> + return;
> +#endif
> +
> + if (rq->curr != p)
> + return;

Yes, I think this should solve the problem.

> prepare_task_switch(struct rq *rq, struct task_struct *prev,
> struct task_struct *next)
> {
> +#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> + rq->in_ctxsw = 1;
> +#endif
> + sched_info_switch(prev, next);
> + perf_event_task_sched_out(prev, next);
> fire_sched_out_preempt_notifiers(prev, next);
> prepare_lock_switch(rq, next);
> prepare_arch_switch(next);
> + trace_sched_switch(prev, next);
> }

Yes, I was wondering why schedule() calls perf_event_task_sched_out().
This way the code looks more symmetrical/understandable.

> /**
> @@ -2823,6 +2860,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> perf_event_task_sched_in(current);
> #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> local_irq_enable();
> + rq->in_ctxsw = 0;

If we think that context_switch finishes here, probably it would be
more clean to clear ->in_ctxsw before local_irq_enable().

> #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
> finish_lock_switch(rq, prev);

But, otoh, maybe finish_lock_switch() can clear in_ctxsw, it already
checks __ARCH_WANT_INTERRUPTS_ON_CTXSW. Likewise, perhaps it can be
set in prepare_lock_switch() which enables irqs.

But this is cosmetic and up to you.

Oleg.

Peter Zijlstra

unread,

Jan 27, 2011, 11:30:02 AM1/27/11

to

On Thu, 2011-01-27 at 17:10 +0100, Oleg Nesterov wrote:
> > #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> > local_irq_enable();
> > + rq->in_ctxsw = 0;
>
> If we think that context_switch finishes here, probably it would be
> more clean to clear ->in_ctxsw before local_irq_enable().

It must in fact be done before, otherwise there's a race where we set
ctx after perf_event_task_sched_in() runs, and we send the IPI, the IPI
lands after local_irq_enable() but before rq->in_ctxsq = 0, the IPI is
ignored, nothing happens.

> > #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
> > finish_lock_switch(rq, prev);
>
> But, otoh, maybe finish_lock_switch() can clear in_ctxsw, it already
> checks __ARCH_WANT_INTERRUPTS_ON_CTXSW. Likewise, perhaps it can be
> set in prepare_lock_switch() which enables irqs.
>
> But this is cosmetic and up to you.

Can't do because of the above thing..

Oleg Nesterov

unread,

Jan 27, 2011, 12:10:01 PM1/27/11

to

On 01/27, Peter Zijlstra wrote:
>

> Right, so the fact of introducing extra scheduling makes me feel
> uncomfortable... the whole purpose is to observe without perturbing (as
> much as possible).

Yes, agreed.

Well, otoh the patch removes the code which sets ->task_ctx from
__perf_install_in_context() and __perf_event_enable(), and perhaps
we could simplify the things further, but anyway I agree.

> Should I think suffice to get the ctx in sync with the task state, we've
> got the following cases:
> 1) task is in the middle of scheduling in
> 2) task is in the middle of scheduling out
> 3) task is running
>
> Without __ARCH_WANT_INTERRUPT_ON_CTXSW everything is boring and works,
> 1: the IPI will be delayed until 3, 2: the IPI finds another task and
> the next schedule in will sort things.
>
> With, however, things are more interesting. 2 seems to be adequately
> covered by the patch I just send, the IPI will bail and the next
> sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
> covered, the IPI will bail, leaving us stranded.

Hmm, yes... Strangely, I missed that when I was thinking about in_ctxsw.

Perhaps, we can change task_oncpu_function_call() so that it returns
-EAGAIN in case it hits in_ctxsw != 0? If the caller sees -EAGAIN, it
should always retry even if !ctx->is_active.

Oleg.

Oleg Nesterov

unread,

Jan 27, 2011, 12:10:02 PM1/27/11

to

On 01/27, Peter Zijlstra wrote:
>

> On Thu, 2011-01-27 at 17:10 +0100, Oleg Nesterov wrote:
> > > #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> > > local_irq_enable();
> > > + rq->in_ctxsw = 0;
> >
> > If we think that context_switch finishes here, probably it would be
> > more clean to clear ->in_ctxsw before local_irq_enable().
>
> It must in fact be done before,

Yes, I alredy realized this when I was reading another email from you.

> > But, otoh, maybe finish_lock_switch() can clear in_ctxsw, it already
> > checks __ARCH_WANT_INTERRUPTS_ON_CTXSW. Likewise, perhaps it can be
> > set in prepare_lock_switch() which enables irqs.
> >
> > But this is cosmetic and up to you.
>
> Can't do because of the above thing..

Right.

Oleg.

Peter Zijlstra

unread,

Jan 27, 2011, 12:20:02 PM1/27/11

to

On Thu, 2011-01-27 at 17:57 +0100, Oleg Nesterov wrote:
>
> > With, however, things are more interesting. 2 seems to be adequately
> > covered by the patch I just send, the IPI will bail and the next
> > sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
> > covered, the IPI will bail, leaving us stranded.
>
> Hmm, yes... Strangely, I missed that when I was thinking about in_ctxsw.
>
> Perhaps, we can change task_oncpu_function_call() so that it returns
> -EAGAIN in case it hits in_ctxsw != 0? If the caller sees -EAGAIN, it
> should always retry even if !ctx->is_active.

That would be very easy to do, we can pass a return value through the
task_function_call structure.

Oleg Nesterov

unread,

Jan 27, 2011, 5:30:01 PM1/27/11

to

On 01/27, Peter Zijlstra wrote:
>

> On Thu, 2011-01-27 at 17:57 +0100, Oleg Nesterov wrote:
> >
> > > With, however, things are more interesting. 2 seems to be adequately
> > > covered by the patch I just send, the IPI will bail and the next
> > > sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
> > > covered, the IPI will bail, leaving us stranded.
> >
> > Hmm, yes... Strangely, I missed that when I was thinking about in_ctxsw.
> >
> > Perhaps, we can change task_oncpu_function_call() so that it returns
> > -EAGAIN in case it hits in_ctxsw != 0? If the caller sees -EAGAIN, it
> > should always retry even if !ctx->is_active.
>
> That would be very easy to do, we can pass a return value through the
> task_function_call structure.

Yes.

Perhaps task_oncpu_function_call() should retry itself to simplify the
callers. I wonder if we should also retry if rq->curr != p...

Oh. You know, I am starting to think I will never understand this.
Forget about the problems with __ARCH_WANT_INTERRUPT_ON_CTXSW.

perf_install_in_context() does task_oncpu_function_call() and then

// ctx->is_active == 0

/*
* The lock prevents that this context is scheduled in so we

* can add the event safely, if it the call above did not

* succeed.
*/
if (list_empty(&event->group_entry))
add_event_to_ctx(event, ctx);

This assumes that the task is not running.

Why? Because (I guess) we assume that either task_oncpu_function_call()
should see task_curr() == T, or if the task becomes running after that
it should see the new ->perf_event_ctxp[ctxn] != NULL. And I do not see
how we can prove this.

If find_get_context() sets the new context, the only guarantee we have
is: perf_event_exit_task() can't miss this context. The task, however,
can be scheduled in and miss the new value in perf_event_ctxp[].
And, task_oncpu_function_call() can equally miss rq->curr == task.

But. I think this all falls into the absolutely theoretical category,
and in the worst case nothing really bad can happen, just event_sched_in()
will be delayed until this task reshedules.

So, I think your patch should fix all problems with schedule. Just it
needs the couple of changes we already discussed:

- finish_task_switch() should clear rq->in_ctxsw before
local_irq_enable()

- task_oncpu_function_call() (or its callers) should always
retry the "if (task_curr(p))" code if ->in_ctxsw is true.

If you think we have other problems here please don't tell me,
I already got lost ;)

Oleg.

Oleg Nesterov

unread,

Jan 21, 2011, 1:10:04 PM1/21/11

to

In theory, almost every user of task->child->perf_event_ctxp[]
is wrong. find_get_context() can install the new context at any
moment, we need read_barrier_depends().

dbe08d82ce3967ccdf459f7951d02589cf967300 "perf: Fix
find_get_context() vs perf_event_exit_task() race" added
rcu_dereference() into perf_event_exit_task_context() to make
the precedent, but this makes __rcu_dereference_check() unhappy.
Use rcu_dereference_raw() to shut up the warning.

Reported-by: Ingo Molnar <mi...@elte.hu>

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

kernel/perf_event.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- git/kernel/perf_event.c~6_rcu_check 2011-01-19 18:49:23.000000000 +0100
+++ git/kernel/perf_event.c 2011-01-21 18:41:02.000000000 +0100
@@ -6133,7 +6133,7 @@ static void perf_event_exit_task_context

* scheduled, so we are now safe from rescheduling changing
* our context.
*/
- child_ctx = rcu_dereference(child->perf_event_ctxp[ctxn]);
+ child_ctx = rcu_dereference_raw(child->perf_event_ctxp[ctxn]);
task_ctx_sched_out(child_ctx, EVENT_ALL);

/*

--

Peter Zijlstra

unread,

Jan 26, 2011, 2:40:02 PM1/26/11

to

On Wed, 2011-01-26 at 20:05 +0100, Peter Zijlstra wrote:

> On Wed, 2011-01-26 at 19:49 +0100, Oleg Nesterov wrote:
> > On 01/26, Oleg Nesterov wrote:
> > >
> > > Please see the untested patch below. It doesn't change perf_event_enable(),
> > > only perf_install_in_context().
> >
> > Forgot to mention... Also, it doesn't try to fix the race with do_exit(),
> > this needs another change.
> >
> > And, damn, can't resist. This is mostly cosmetic issue, but I feel
> > discomfort every time I look at task_oncpu_function_call(). It _looks_
> > obviously wrong, even if the problem doesn't exist in practice. I'll
> > send the pedantic fix to keep the maintainers busy ;)
>
> I've been trying to sit down and work my way through it today, your last
> suggestion very nearly seemed to make sense, but I kept getting
> distracted.
>
> FWIW I think perf_event_enable() has the very same issue...

Wouldn't something like the below cure things too?

---
kernel/sched.c | 23 ++++++++++++++++++++++-
1 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..7eadbcf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2265,6 +2265,22 @@ void kick_process(struct task_struct *p)

EXPORT_SYMBOL_GPL(kick_process);
#endif /* CONFIG_SMP */

+struct task_function_call {
+ struct task_struct *p;
+ void (*func)(void *info);
+ void *info;

+};
+

+void task_function_trampoline(void *data)
+{
+ struct task_function_call *tfc = data;
+

+ if (this_rq()->curr != tfc->p)
+ return;
+
+ tfc->func(tfc->data);

+}
+
/**
* task_oncpu_function_call - call a function on the cpu on which a task runs
* @p: the task to evaluate

@@ -2278,11 +2294,16 @@ void task_oncpu_function_call(struct task_struct *p,

void (*func) (void *info), void *info)

{
int cpu;

+ struct task_function_call data = {
+ .p = p,
+ .func = func,
+ .info = info,
+ };

preempt_disable();
cpu = task_cpu(p);

if (task_curr(p))
- smp_call_function_single(cpu, func, info, 1);

+ smp_call_function_single(cpu, task_function_trampoline, &data, 1);
preempt_enable();
}

Peter Zijlstra

unread,

Jan 28, 2011, 7:50:01 AM1/28/11

to

On Thu, 2011-01-27 at 23:18 +0100, Oleg Nesterov wrote:
> On 01/27, Peter Zijlstra wrote:
> >
> > On Thu, 2011-01-27 at 17:57 +0100, Oleg Nesterov wrote:
> > >
> > > > With, however, things are more interesting. 2 seems to be adequately
> > > > covered by the patch I just send, the IPI will bail and the next
> > > > sched-in of the relevant task will pick matters up. 1 otoh doesn't seem
> > > > covered, the IPI will bail, leaving us stranded.
> > >
> > > Hmm, yes... Strangely, I missed that when I was thinking about in_ctxsw.
> > >
> > > Perhaps, we can change task_oncpu_function_call() so that it returns
> > > -EAGAIN in case it hits in_ctxsw != 0? If the caller sees -EAGAIN, it
> > > should always retry even if !ctx->is_active.
> >
> > That would be very easy to do, we can pass a return value through the
> > task_function_call structure.
>
> Yes.
>
> Perhaps task_oncpu_function_call() should retry itself to simplify the
> callers. I wonder if we should also retry if rq->curr != p...

Yes we should, the task could have been migrated and be running on
another cpu..

> Oh. You know, I am starting to think I will never understand this.

Oh, please don't give up, we shall persevere with this until it all
makes perfect sense (or we're both mental and get locked up), it can
only improve matters, right? :-)

> perf_install_in_context() does task_oncpu_function_call() and then
>
>
> // ctx->is_active == 0
>
> /*
> * The lock prevents that this context is scheduled in so we
> * can add the event safely, if it the call above did not
> * succeed.
> */
> if (list_empty(&event->group_entry))
> add_event_to_ctx(event, ctx);
>
> This assumes that the task is not running.
>
> Why? Because (I guess) we assume that either task_oncpu_function_call()
> should see task_curr() == T, or if the task becomes running after that
> it should see the new ->perf_event_ctxp[ctxn] != NULL. And I do not see
> how we can prove this.

Right, that is the intended logic, lets see if I can make that be true.

So task_oncpu_function_call() as per after the below patch, will loop
until either:

- the task isn't running, or
- we executed the function on the cpu during the task's stay there

If it isn't running, it might have scheduled in by the time we've
acquired the ctx->lock, the ->is_active test catches that and retries
the task_oncpu_function_call(), if its still not running, us holding the
ctx->lock ensures its possible schedule-in on another cpu will be held
up at perf_event_task_sched_in().

Now:

> If find_get_context() sets the new context, the only guarantee we have
> is: perf_event_exit_task() can't miss this context. The task, however,
> can be scheduled in and miss the new value in perf_event_ctxp[].
> And, task_oncpu_function_call() can equally miss rq->curr == task.

Right, so in case the perf_event_task_sched_in() missed the assignment
of ->perf_event_ctxp[n], then our above story falls flat on its face.

Because then we can not rely on ->in_active being set for running tasks.

So we need a task_curr() test under that lock, which would need
perf_event_task_sched_out() to be done _before_ we set rq->curr = next,
I _think_.

> But. I think this all falls into the absolutely theoretical category,
> and in the worst case nothing really bad can happen, just event_sched_in()
> will be delayed until this task reshedules.

Still, it would be bad if some HPC workload (1 task per cpu, very sparse
syscalls, hardly no scheduling at all) would go wonky once in a blue
moon.

More importantly I think, it would be best if this code were obvious, it
clearly isn't, so lets hang in here for a little while more.

> So, I think your patch should fix all problems with schedule. Just it
> needs the couple of changes we already discussed:
>
> - finish_task_switch() should clear rq->in_ctxsw before
> local_irq_enable()

check, although I should still add at least a little comment in
task_oncpu_function_call() explaining things.

> - task_oncpu_function_call() (or its callers) should always
> retry the "if (task_curr(p))" code if ->in_ctxsw is true.

check.

> If you think we have other problems here please don't tell me,
> I already got lost ;)

Sorry to bother you more, but I think we're actually getting
somewhere...

---
include/linux/sched.h | 4 +-
kernel/sched.c | 64 ++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..b147d73 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2581,8 +2581,8 @@ static inline void inc_syscw(struct task_struct *tsk)
/*
* Call the function if the target task is executing on a CPU right now:
*/
-extern void task_oncpu_function_call(struct task_struct *p,
- void (*func) (void *info), void *info);
+extern int task_oncpu_function_call(struct task_struct *p,
+ void (*func) (void *info), void *info);

#ifdef CONFIG_MM_OWNER
diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..9ef760c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c

@@ -490,7 +490,10 @@ struct rq {
struct task_struct *curr, *idle, *stop;
unsigned long next_balance;
struct mm_struct *prev_mm;
-
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+ int in_ctxsw;
+#endif
+
u64 clock;
u64 clock_task;

@@ -2265,6 +2268,34 @@ void kick_process(struct task_struct *p)

EXPORT_SYMBOL_GPL(kick_process);
#endif /* CONFIG_SMP */

+struct task_function_call {
+ struct task_struct *p;
+ void (*func)(void *info);
+ void *info;

+ int ret;

+};
+
+void task_function_trampoline(void *data)
+{
+ struct task_function_call *tfc = data;

+ struct task_struct *p = tfc->p;
+ struct rq *rq = this_rq();
+

+ tfc->ret = -EAGAIN;

+
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+ if (rq->in_ctxsw)
+ return;
+#endif
+

+ if (rq->curr != p)
+ return;
+
+ tfc->ret = 0;

+
+ tfc->func(tfc->info);

+}
+
/**
* task_oncpu_function_call - call a function on the cpu on which a task runs
* @p: the task to evaluate

@@ -2273,17 +2304,31 @@ EXPORT_SYMBOL_GPL(kick_process);
*
* Calls the function @func when the task is currently running. This might
* be on the current CPU, which just calls the function directly
+ *
+ * returns: 0 when @func got called
*/
-void task_oncpu_function_call(struct task_struct *p,
+int task_oncpu_function_call(struct task_struct *p,

void (*func) (void *info), void *info)

{

+ struct task_function_call data = {
+ .p = p,
+ .func = func,
+ .info = info,
+ };

int cpu;

preempt_disable();
+again:
+ data.ret = -ESRCH; /* No such (running) process */
cpu = task_cpu(p);
- if (task_curr(p))

- smp_call_function_single(cpu, func, info, 1);

+ if (task_curr(p)) {

+ smp_call_function_single(cpu, task_function_trampoline, &data, 1);

+ if (data.ret == -EAGAIN)
+ goto again;
+ }
preempt_enable();
+
+ return data.ret;
}

#ifdef CONFIG_SMP
@@ -2776,9 +2821,15 @@ static inline void

prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
+#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
+ rq->in_ctxsw = 1;
+#endif
+ sched_info_switch(prev, next);
+ perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
+ trace_sched_switch(prev, next);
}

/**
@@ -2822,6 +2873,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
perf_event_task_sched_in(current);
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW

+ rq->in_ctxsw = 0;

local_irq_enable();

#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
finish_lock_switch(rq, prev);

@@ -2911,7 +2963,6 @@ context_switch(struct rq *rq, struct task_struct *prev,

struct mm_struct *mm, *oldmm;

prepare_task_switch(rq, prev, next);
- trace_sched_switch(prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*

@@ -3989,9 +4040,6 @@ need_resched_nonpreemptible:

rq->skip_clock_update = 0;

if (likely(prev != next)) {
- sched_info_switch(prev, next);
- perf_event_task_sched_out(prev, next);
-

rq->nr_switches++;
rq->curr = next;
++*switch_count;

--

Peter Zijlstra

unread,

Jan 28, 2011, 10:00:01 AM1/28/11

to

On Fri, 2011-01-28 at 12:52 +0100, Peter Zijlstra wrote:
> Right, so in case the perf_event_task_sched_in() missed the assignment
> of ->perf_event_ctxp[n], then our above story falls flat on its face.
>
> Because then we can not rely on ->in_active being set for running tasks.
>
> So we need a task_curr() test under that lock, which would need
> perf_event_task_sched_out() to be done _before_ we set rq->curr = next,
> I _think_.

Ok, so how about something like this:

if task_oncpu_function_call() managed to execute the function proper,
we're done. Otherwise, if while holding the lock, task_curr() is true,
it means the task is now current and we should try again, if its not, it
cannot become current because us holding ctx->lock blocks
perf_event_task_sched_in().

Hmm?

---
include/linux/sched.h | 4 +-
kernel/perf_event.c | 23 ++++++++++-------
kernel/sched.c | 65 +++++++++++++++++++++++++++++++++++++++++++------
3 files changed, 73 insertions(+), 19 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..b147d73 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2581,8 +2581,8 @@ static inline void inc_syscw(struct task_struct *tsk)
/*
* Call the function if the target task is executing on a CPU right now:
*/
-extern void task_oncpu_function_call(struct task_struct *p,
- void (*func) (void *info), void *info);
+extern int task_oncpu_function_call(struct task_struct *p,
+ void (*func) (void *info), void *info);

#ifdef CONFIG_MM_OWNER

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 852ae8c..0d988b8 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1017,6 +1017,7 @@ perf_install_in_context(struct perf_event_context *ctx,

int cpu)
{
struct task_struct *task = ctx->task;

+ int ret;

event->ctx = ctx;

@@ -1031,25 +1032,29 @@ perf_install_in_context(struct perf_event_context *ctx,

}

retry:
- task_oncpu_function_call(task, __perf_install_in_context,
- event);

+ ret = task_oncpu_function_call(task,
+ __perf_install_in_context, event);
+
+ if (!ret)
+ return;

raw_spin_lock_irq(&ctx->lock);
+

/*
- * we need to retry the smp call.

+ * If the task_oncpu_function_call() failed, re-check task_curr() now
+ * that we hold ctx->lock(), if it is running retry the IPI.

*/
- if (ctx->is_active && list_empty(&event->group_entry)) {

+ if (task_curr(task)) {
raw_spin_unlock_irq(&ctx->lock);
goto retry;
}

/*
- * The lock prevents that this context is scheduled in so we
- * can add the event safely, if it the call above did not
- * succeed.
+ * Otherwise the lock prevents that this context is scheduled in so we
+ * can add the event safely, if it the call above did not succeed.

*/
- if (list_empty(&event->group_entry))
- add_event_to_ctx(event, ctx);

+ add_event_to_ctx(event, ctx);
+
raw_spin_unlock_irq(&ctx->lock);
}

diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..3686dce 100644

@@ -2911,7 +2963,7 @@ context_switch(struct rq *rq, struct task_struct *prev,

struct mm_struct *mm, *oldmm;

prepare_task_switch(rq, prev, next);
- trace_sched_switch(prev, next);

+

mm = next->mm;
oldmm = prev->active_mm;
/*

@@ -3989,9 +4041,6 @@ need_resched_nonpreemptible:

Oleg Nesterov

unread,

Jan 28, 2011, 11:40:01 AM1/28/11

to

On 01/28, Peter Zijlstra wrote:
>
> On Fri, 2011-01-28 at 12:52 +0100, Peter Zijlstra wrote:
> > Right, so in case the perf_event_task_sched_in() missed the assignment
> > of ->perf_event_ctxp[n], then our above story falls flat on its face.
> >
> > Because then we can not rely on ->in_active being set for running tasks.
> >
> > So we need a task_curr() test under that lock, which would need
> > perf_event_task_sched_out() to be done _before_ we set rq->curr = next,
> > I _think_.
>
> Ok, so how about something like this:
>
> if task_oncpu_function_call() managed to execute the function proper,
> we're done. Otherwise, if while holding the lock, task_curr() is true,
> it means the task is now current and we should try again, if its not, it
> cannot become current because us holding ctx->lock blocks
> perf_event_task_sched_in().
>
> Hmm?

I _feel_ this patch should be right. To me, this even makes the code
more understandable. But I'll try to re-read it once again, somehow
I can't concentrace today.

> @@ -1031,25 +1032,29 @@ perf_install_in_context(struct perf_event_context *ctx,
> }
>
> retry:
> - task_oncpu_function_call(task, __perf_install_in_context,
> - event);
> + ret = task_oncpu_function_call(task,
> + __perf_install_in_context, event);
> +
> + if (!ret)
> + return;
>
> raw_spin_lock_irq(&ctx->lock);
> +
> /*
> - * we need to retry the smp call.
> + * If the task_oncpu_function_call() failed, re-check task_curr() now
> + * that we hold ctx->lock(), if it is running retry the IPI.
> */
> - if (ctx->is_active && list_empty(&event->group_entry)) {
> + if (task_curr(task)) {

Yes, but task_curr() should be exported.

One note. If this patch is correct (I think it is), then this check
in __perf_install_in_context() and __perf_event_enable()

if (cpuctx->task_ctx || ctx->task != current)

return;

should become unneeded. It should be removed or turned into WARN_ON()
imho, otherwise it looks confusing.

Oleg.

Peter Zijlstra

unread,

Jan 28, 2011, 1:20:02 PM1/28/11

to

On Fri, 2011-01-28 at 17:28 +0100, Oleg Nesterov wrote:
>
> I _feel_ this patch should be right. To me, this even makes the code
> more understandable. But I'll try to re-read it once again, somehow
> I can't concentrace today.

Just to give you more food for through, I couldn't help myself..

(compile tested only so far)

---
include/linux/sched.h | 7 --
kernel/perf_event.c | 235 +++++++++++++++++++++++++++++++------------------
kernel/sched.c | 31 +------
3 files changed, 156 insertions(+), 117 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d747f94..0b40ee3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2578,13 +2578,6 @@ static inline void inc_syscw(struct task_struct *tsk)
#define TASK_SIZE_OF(tsk) TASK_SIZE
#endif

-/*
- * Call the function if the target task is executing on a CPU right now:
- */

-extern void task_oncpu_function_call(struct task_struct *p,
- void (*func) (void *info), void *info);

-
-

#ifdef CONFIG_MM_OWNER
extern void mm_update_next_owner(struct mm_struct *mm);

extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 852ae8c..cb62433 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -38,6 +38,83 @@

#include <asm/irq_regs.h>

+struct remote_function_call {
+ struct task_struct *p;
+ int (*func)(void *info);

+ void *info;
+ int ret;
+};
+

+static void remote_function(void *data)
+{
+ struct remote_function_call *tfc = data;

+ struct task_struct *p = tfc->p;
+

+ if (p) {

+ tfc->ret = -EAGAIN;

+ if (task_cpu(p) != smp_processor_id() || !task_curr(p));
+ return;
+ }
+
+ tfc->ret = tfc->func(tfc->info);
+}
+
+/**
+ * task_function_call - call a function on the cpu on which a task runs
+ * @p: the task to evaluate
+ * @func: the function to be called
+ * @info: the function call argument
+ *
+ * Calls the function @func when the task is currently running. This might
+ * be on the current CPU, which just calls the function directly
+ *
+ * returns: @func return value, or
+ * -ESRCH - when the process isn't running
+ * -EAGAIN - when the process moved away
+ */
+static int
+task_function_call(struct task_struct *p, int (*func) (void *info), void *info)
+{
+ struct remote_function_call data = {

+ .p = p,
+ .func = func,
+ .info = info,

+ .ret = -ESRCH, /* No such (running) process */
+ };
+ int cpu;
+
+ preempt_disable();
+ cpu = task_cpu(p);
+ if (task_curr(p))
+ smp_call_function_single(cpu, remote_function, &data, 1);

+ preempt_enable();
+
+ return data.ret;

+}
+
+/**
+ * cpu_function_call - call a function on the cpu
+ * @func: the function to be called
+ * @info: the function call argument
+ *
+ * Calls the function @func on the remote cpu.
+ *
+ * returns: @func return value or -ENXIO when the cpu is offline
+ */
+static int cpu_function_call(int cpu, int (*func) (void *info), void *info)
+{
+ struct remote_function_call data = {
+ .p = NULL,

+ .func = func,
+ .info = info,

+ .ret = -ENXIO, /* No such CPU */
+ };
+
+ smp_call_function_single(cpu, remote_function, &data, 1);
+
+ return data.ret;
+}
+
enum event_type_t {
EVENT_FLEXIBLE = 0x1,
EVENT_PINNED = 0x2,
@@ -618,27 +695,18 @@ __get_cpu_context(struct perf_event_context *ctx)
* We disable the event on the hardware level first. After that we
* remove it from the context list.
*/
-static void __perf_event_remove_from_context(void *info)
+static int __perf_remove_from_context(void *info)
{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;
struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);

- /*
- * If this is a task context, we need to check whether it is

- * the current task context of this cpu. If not it has been
- * scheduled out before the smp call arrived.

- */

- if (ctx->task && cpuctx->task_ctx != ctx)

- return;
-
raw_spin_lock(&ctx->lock);
-
event_sched_out(event, cpuctx, ctx);
-
list_del_event(event, ctx);
-
raw_spin_unlock(&ctx->lock);
+
+ return 0;
}

@@ -657,7 +725,7 @@ static void __perf_event_remove_from_context(void *info)
* When called from perf_event_exit_task, it's OK because the
* context has been detached from its task.
*/
-static void perf_event_remove_from_context(struct perf_event *event)
+static void perf_remove_from_context(struct perf_event *event)
{
struct perf_event_context *ctx = event->ctx;

struct task_struct *task = ctx->task;

@@ -667,39 +735,36 @@ static void perf_event_remove_from_context(struct perf_event *event)
* Per cpu events are removed via an smp call and
* the removal is always successful.
*/
- smp_call_function_single(event->cpu,
- __perf_event_remove_from_context,
- event, 1);
+ cpu_function_call(event->cpu, __perf_remove_from_context, event);
return;
}

retry:
- task_oncpu_function_call(task, __perf_event_remove_from_context,
- event);
+ if (!task_function_call(task, __perf_remove_from_context, event))
+ return;

raw_spin_lock_irq(&ctx->lock);
/*
- * If the context is active we need to retry the smp call.
+ * If we failed to find a running task, but find it running now that
+ * we've acquired the ctx->lock, retry.
*/
- if (ctx->nr_active && !list_empty(&event->group_entry)) {

+ if (task_curr(task)) {
raw_spin_unlock_irq(&ctx->lock);
goto retry;
}

/*
- * The lock prevents that this context is scheduled in so we

- * can remove the event safely, if the call above did not
- * succeed.
+ * Since the task isn't running, its safe to remove the event, us
+ * holding the ctx->lock ensures the task won't get scheduled in.
*/
- if (!list_empty(&event->group_entry))
- list_del_event(event, ctx);
+ list_del_event(event, ctx);
raw_spin_unlock_irq(&ctx->lock);
}

/*
* Cross CPU call to disable a performance event
*/
-static void __perf_event_disable(void *info)
+static int __perf_event_disable(void *info)
{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;
@@ -710,7 +775,7 @@ static void __perf_event_disable(void *info)
* event's task is the current task on this cpu.
*/

if (ctx->task && cpuctx->task_ctx != ctx)

- return;
+ return -EINVAL;

raw_spin_lock(&ctx->lock);

@@ -729,6 +794,8 @@ static void __perf_event_disable(void *info)
}

raw_spin_unlock(&ctx->lock);
+
+ return 0;
}

/*
@@ -753,13 +820,13 @@ void perf_event_disable(struct perf_event *event)
/*
* Disable the event on the cpu that it's on
*/
- smp_call_function_single(event->cpu, __perf_event_disable,
- event, 1);
+ cpu_function_call(event->cpu, __perf_event_disable, event);
return;
}

retry:
- task_oncpu_function_call(task, __perf_event_disable, event);
+ if (!task_function_call(task, __perf_event_disable, event))
+ return;

raw_spin_lock_irq(&ctx->lock);
/*
@@ -767,6 +834,11 @@ retry:
*/
if (event->state == PERF_EVENT_STATE_ACTIVE) {
raw_spin_unlock_irq(&ctx->lock);
+ /*
+ * Reload the task pointer, it might have been changed by
+ * a concurrent perf_event_context_sched_out().
+ */
+ task = ctx->task;
goto retry;
}

@@ -778,7 +850,6 @@ retry:
update_group_times(event);
event->state = PERF_EVENT_STATE_OFF;
}
-
raw_spin_unlock_irq(&ctx->lock);
}

@@ -928,12 +999,14 @@ static void add_event_to_ctx(struct perf_event *event,
event->tstamp_stopped = tstamp;
}

+static void perf_event_context_sched_in(struct perf_event_context *ctx);
+
/*
* Cross CPU call to install and enable a performance event
*
* Must be called with ctx->mutex held
*/
-static void __perf_install_in_context(void *info)
+static int __perf_install_in_context(void *info)
{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;
@@ -942,20 +1015,15 @@ static void __perf_install_in_context(void *info)
int err;

/*
- * If this is a task context, we need to check whether it is

- * the current task context of this cpu. If not it has been
- * scheduled out before the smp call arrived.
- * Or possibly this is the right context but it isn't
- * on this cpu because it had no events.

+ * In case we're installing a new context to an already running task,
+ * could also happen before perf_event_task_sched_in() on architectures
+ * which do context switches with IRQs enabled.

*/
- if (ctx->task && cpuctx->task_ctx != ctx) {
- if (cpuctx->task_ctx || ctx->task != current)
- return;
- cpuctx->task_ctx = ctx;
- }

+ if (ctx->task && !cpuctx->task_ctx)
+ perf_event_context_sched_in(ctx);

raw_spin_lock(&ctx->lock);
- ctx->is_active = 1;
+ WARN_ON_ONCE(!ctx->is_active);
update_context_time(ctx);

add_event_to_ctx(event, ctx);
@@ -997,6 +1065,8 @@ static void __perf_install_in_context(void *info)

unlock:
raw_spin_unlock(&ctx->lock);
+
+ return 0;
}

/*
@@ -1025,31 +1095,29 @@ perf_install_in_context(struct perf_event_context *ctx,

* Per cpu events are installed via an smp call and
* the install is always successful.
*/

- smp_call_function_single(cpu, __perf_install_in_context,
- event, 1);
+ cpu_function_call(cpu, __perf_install_in_context, event);
return;

}

retry:
- task_oncpu_function_call(task, __perf_install_in_context,
- event);

+ if (!task_function_call(task, __perf_install_in_context, event))
+ return;

raw_spin_lock_irq(&ctx->lock);

/*
- * we need to retry the smp call.

+ * If we failed to find a running task, but find it running now that
+ * we've acquired the ctx->lock, retry.

*/
- if (ctx->is_active && list_empty(&event->group_entry)) {
+ if (task_curr(task)) {

raw_spin_unlock_irq(&ctx->lock);
goto retry;
}

/*
- * The lock prevents that this context is scheduled in so we
- * can add the event safely, if it the call above did not
- * succeed.

+ * Since the task isn't running, its safe to add the event, us holding
+ * the ctx->lock ensures the task won't get scheduled in.

*/
- if (list_empty(&event->group_entry))
- add_event_to_ctx(event, ctx);
+ add_event_to_ctx(event, ctx);

raw_spin_unlock_irq(&ctx->lock);
}

@@ -1078,7 +1146,7 @@ static void __perf_event_mark_enabled(struct perf_event *event,
/*
* Cross CPU call to enable a performance event
*/
-static void __perf_event_enable(void *info)
+static int __perf_event_enable(void *info)
{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;
@@ -1086,18 +1154,10 @@ static void __perf_event_enable(void *info)
struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
int err;

- /*
- * If this is a per-task event, need to check whether this
- * event's task is the current task on this cpu.
- */

- if (ctx->task && cpuctx->task_ctx != ctx) {
- if (cpuctx->task_ctx || ctx->task != current)
- return;
- cpuctx->task_ctx = ctx;
- }

+ if (WARN_ON_ONCE(!ctx->is_active))
+ return -EINVAL;

raw_spin_lock(&ctx->lock);
- ctx->is_active = 1;
update_context_time(ctx);

if (event->state >= PERF_EVENT_STATE_INACTIVE)
@@ -1138,6 +1198,8 @@ static void __perf_event_enable(void *info)

unlock:
raw_spin_unlock(&ctx->lock);
+
+ return 0;
}

/*
@@ -1158,8 +1220,7 @@ void perf_event_enable(struct perf_event *event)
/*
* Enable the event on the cpu that it's on
*/
- smp_call_function_single(event->cpu, __perf_event_enable,
- event, 1);
+ cpu_function_call(event->cpu, __perf_event_enable, event);
return;
}

@@ -1178,8 +1239,15 @@ void perf_event_enable(struct perf_event *event)
event->state = PERF_EVENT_STATE_OFF;

retry:
+ if (!ctx->is_active) {
+ __perf_event_mark_enabled(event, ctx);
+ goto out;
+ }
+
raw_spin_unlock_irq(&ctx->lock);
- task_oncpu_function_call(task, __perf_event_enable, event);
+
+ if (!task_function_call(task, __perf_event_enable, event))
+ return;

raw_spin_lock_irq(&ctx->lock);

@@ -1187,15 +1255,14 @@ retry:
* If the context is active and the event is still off,
* we need to retry the cross-call.
*/
- if (ctx->is_active && event->state == PERF_EVENT_STATE_OFF)
+ if (ctx->is_active && event->state == PERF_EVENT_STATE_OFF) {
+ /*
+ * task could have been flipped by a concurrent
+ * perf_event_context_sched_out()
+ */
+ task = ctx->task;

goto retry;
-
- /*

- * Since we have the lock this context can't be scheduled
- * in, so we can change the state safely.
- */
- if (event->state == PERF_EVENT_STATE_OFF)
- __perf_event_mark_enabled(event, ctx);
+ }

out:
raw_spin_unlock_irq(&ctx->lock);
@@ -1339,8 +1406,8 @@ static void perf_event_sync_stat(struct perf_event_context *ctx,
}
}

-void perf_event_context_sched_out(struct task_struct *task, int ctxn,
- struct task_struct *next)
+static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
+ struct task_struct *next)
{
struct perf_event_context *ctx = task->perf_event_ctxp[ctxn];
struct perf_event_context *next_ctx;
@@ -1541,7 +1608,7 @@ static void task_ctx_sched_in(struct perf_event_context *ctx,
cpuctx->task_ctx = ctx;
}

-void perf_event_context_sched_in(struct perf_event_context *ctx)
+static void perf_event_context_sched_in(struct perf_event_context *ctx)
{
struct perf_cpu_context *cpuctx;

@@ -5949,10 +6016,10 @@ SYSCALL_DEFINE5(perf_event_open,
struct perf_event_context *gctx = group_leader->ctx;

mutex_lock(&gctx->mutex);
- perf_event_remove_from_context(group_leader);
+ perf_remove_from_context(group_leader);
list_for_each_entry(sibling, &group_leader->sibling_list,
group_entry) {
- perf_event_remove_from_context(sibling);
+ perf_remove_from_context(sibling);
put_ctx(gctx);
}
mutex_unlock(&gctx->mutex);
@@ -6103,7 +6170,7 @@ __perf_event_exit_task(struct perf_event *child_event,
{
struct perf_event *parent_event;

- perf_event_remove_from_context(child_event);
+ perf_remove_from_context(child_event);

parent_event = child_event->parent;
/*
@@ -6594,9 +6661,9 @@ static void __perf_event_exit_context(void *__info)
perf_pmu_rotate_stop(ctx->pmu);

list_for_each_entry_safe(event, tmp, &ctx->pinned_groups, group_entry)
- __perf_event_remove_from_context(event);
+ __perf_remove_from_context(event);
list_for_each_entry_safe(event, tmp, &ctx->flexible_groups, group_entry)
- __perf_event_remove_from_context(event);
+ __perf_remove_from_context(event);
}

static void perf_event_exit_cpu_context(int cpu)
diff --git a/kernel/sched.c b/kernel/sched.c
index 18d38e4..01549b0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -490,7 +490,7 @@ struct rq {

struct task_struct *curr, *idle, *stop;
unsigned long next_balance;
struct mm_struct *prev_mm;
-

+
u64 clock;
u64 clock_task;

@@ -2265,27 +2265,6 @@ void kick_process(struct task_struct *p)

EXPORT_SYMBOL_GPL(kick_process);
#endif /* CONFIG_SMP */

-/**
- * task_oncpu_function_call - call a function on the cpu on which a task runs
- * @p: the task to evaluate
- * @func: the function to be called
- * @info: the function call argument
- *
- * Calls the function @func when the task is currently running. This might
- * be on the current CPU, which just calls the function directly
- */
-void task_oncpu_function_call(struct task_struct *p,

- void (*func) (void *info), void *info)

-{
- int cpu;
-
- preempt_disable();
- cpu = task_cpu(p);

- if (task_curr(p))
- smp_call_function_single(cpu, func, info, 1);

- preempt_enable();
-}
-
#ifdef CONFIG_SMP
/*
* ->cpus_allowed is protected by either TASK_WAKING or rq->lock held.
@@ -2776,9 +2755,12 @@ static inline void

prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{

+ sched_info_switch(prev, next);
+ perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
+ trace_sched_switch(prev, next);
}

/**

@@ -2911,7 +2893,7 @@ context_switch(struct rq *rq, struct task_struct *prev,

struct mm_struct *mm, *oldmm;

prepare_task_switch(rq, prev, next);
- trace_sched_switch(prev, next);
+
mm = next->mm;
oldmm = prev->active_mm;
/*

@@ -3989,9 +3971,6 @@ need_resched_nonpreemptible:

rq->skip_clock_update = 0;

if (likely(prev != next)) {
- sched_info_switch(prev, next);
- perf_event_task_sched_out(prev, next);
-
rq->nr_switches++;
rq->curr = next;
++*switch_count;

Oleg Nesterov

unread,

Jan 31, 2011, 1:20:02 PM1/31/11

to

On 01/28, Peter Zijlstra wrote:
>

> Just to give you more food for through, I couldn't help myself..

Hmm. So far I am only trying to understand the perf_install_in_context()
paths. And, after I spent almost 2 hours, I am starting to believe this
change is probably good ;)

I do not understand the point of cpu_function_call() though, it looks
equal to smp_call_function_single() ?

> -static void __perf_install_in_context(void *info)
> +static int __perf_install_in_context(void *info)
> {
> struct perf_event *event = info;
> struct perf_event_context *ctx = event->ctx;
> @@ -942,20 +1015,15 @@ static void __perf_install_in_context(void *info)
> int err;
>
> /*
> - * If this is a task context, we need to check whether it is
> - * the current task context of this cpu. If not it has been
> - * scheduled out before the smp call arrived.
> - * Or possibly this is the right context but it isn't
> - * on this cpu because it had no events.
> + * In case we're installing a new context to an already running task,
> + * could also happen before perf_event_task_sched_in() on architectures
> + * which do context switches with IRQs enabled.
> */
> - if (ctx->task && cpuctx->task_ctx != ctx) {
> - if (cpuctx->task_ctx || ctx->task != current)
> - return;
> - cpuctx->task_ctx = ctx;
> - }
> + if (ctx->task && !cpuctx->task_ctx)
> + perf_event_context_sched_in(ctx);

OK... This eliminates the 2nd race with __ARCH_WANT_INTERRUPTS_ON_CTXSW
(we must not set "cpuctx->task_ctx = ctx" in case "next" is going to
do perf_event_context_sched_in() later). So it is enough to check
rq->curr in remote_function().

> raw_spin_lock(&ctx->lock);
> - ctx->is_active = 1;
> + WARN_ON_ONCE(!ctx->is_active);

This looks wrong if ctx->task == NULL.

So. With this patch it is possible that perf_event_context_sched_in()
is called right after prepare_lock_switch(). Stupid question, why
can't we always do this then? I mean, what if we change
prepare_task_switch() to do

perf_event_task_sched_out();
perf_event_task_sched_in();

?

Probably we can unify the COND_STMT(perf_task_events) check and simplify
the things further.

Oleg.

Peter Zijlstra

unread,

Jan 31, 2011, 1:30:02 PM1/31/11

to

On Mon, 2011-01-31 at 18:26 +0100, Oleg Nesterov wrote:
> On 01/28, Peter Zijlstra wrote:
> >
> > Just to give you more food for through, I couldn't help myself..
>
> Hmm. So far I am only trying to understand the perf_install_in_context()
> paths. And, after I spent almost 2 hours, I am starting to believe this
> change is probably good ;)

phew ;-)

> I do not understand the point of cpu_function_call() though, it looks
> equal to smp_call_function_single() ?

Very nearly so, except it takes a function that returns an int..

Right, but since I moved those functions into perf_event.c (they were
getting rather specific) I can no longer deref (or even obtain) a rq
structure. So it implements rq->curr == p in a somewhat round-about
fashion but it should be identical.

>
> > raw_spin_lock(&ctx->lock);
> > - ctx->is_active = 1;
> > + WARN_ON_ONCE(!ctx->is_active);
>
> This looks wrong if ctx->task == NULL.

cpuctx->ctx should still have ->is_active = 1 I think.

>
> So. With this patch it is possible that perf_event_context_sched_in()
> is called right after prepare_lock_switch(). Stupid question, why
> can't we always do this then? I mean, what if we change
> prepare_task_switch() to do
>
> perf_event_task_sched_out();
> perf_event_task_sched_in();
>
> ?
>
> Probably we can unify the COND_STMT(perf_task_events) check and simplify
> the things further.

That might work, Ingo any reason we have a pre and post hook around the
context switch and not a single function?

Oleg Nesterov

unread,

Jan 31, 2011, 2:20:03 PM1/31/11

to

On 01/31, Peter Zijlstra wrote:
>
> On Mon, 2011-01-31 at 18:26 +0100, Oleg Nesterov wrote:
>
> > I do not understand the point of cpu_function_call() though, it looks
> > equal to smp_call_function_single() ?
>
> Very nearly so, except it takes a function that returns an int..

Ah, indeed...

> > > raw_spin_lock(&ctx->lock);
> > > - ctx->is_active = 1;
> > > + WARN_ON_ONCE(!ctx->is_active);
> >
> > This looks wrong if ctx->task == NULL.
>
> cpuctx->ctx should still have ->is_active = 1 I think.

Unless this is the first cpu counter, no?

Oleg.

Peter Zijlstra

unread,

Jan 31, 2011, 2:30:02 PM1/31/11

to

On Mon, 2011-01-31 at 20:11 +0100, Oleg Nesterov wrote:
>
> > > > raw_spin_lock(&ctx->lock);
> > > > - ctx->is_active = 1;
> > > > + WARN_ON_ONCE(!ctx->is_active);
> > >
> > > This looks wrong if ctx->task == NULL.
> >
> > cpuctx->ctx should still have ->is_active = 1 I think.
>
> Unless this is the first cpu counter, no?

Ah, indeed..

Peter Zijlstra

unread,

Feb 1, 2011, 9:10:02 AM2/1/11

to

Oleg, I've actually run-tested the below and all seems well (clearly
I've never actually hit the races found before either, so in that
respect its not a conclusive test).

Can you agree with this patch?

---
Oleg reported that on architectures with
__ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI from task_oncpu_function_call()
can land before perf_event_task_sched_in() and cause interesting
situations for eg. perf_install_in_context().

This patch reworks the task_oncpu_function_call() interface to give a
more usable primitive as well as rework all its users to hopefully be
more obvious as well as remove the races.

Reported-by: Oleg Nesterov <ol...@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
---
include/linux/sched.h | 7 --
kernel/perf_event.c | 240 +++++++++++++++++++++++++++++++------------------
kernel/sched.c | 29 +-----
3 files changed, 157 insertions(+), 119 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3294f60..57ad0f9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2575,13 +2575,6 @@ static inline void inc_syscw(struct task_struct *tsk)

#define TASK_SIZE_OF(tsk) TASK_SIZE
#endif

-/*
- * Call the function if the target task is executing on a CPU right now:
- */
-extern void task_oncpu_function_call(struct task_struct *p,
- void (*func) (void *info), void *info);
-
-
#ifdef CONFIG_MM_OWNER
extern void mm_update_next_owner(struct mm_struct *mm);
extern void mm_init_owner(struct mm_struct *mm, struct task_struct *p);
diff --git a/kernel/perf_event.c b/kernel/perf_event.c

index 852ae8c..b81f31f 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -38,6 +38,79 @@

+ if (task_curr(p))
+ smp_call_function_single(task_cpu(p), remote_function, &data, 1);

@@ -618,35 +691,24 @@ __get_cpu_context(struct perf_event_context *ctx)

* We disable the event on the hardware level first. After that we
* remove it from the context list.
*/
-static void __perf_event_remove_from_context(void *info)

+static int __perf_remove_from_context(void *info)

{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;

struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);

- /*

- * If this is a task context, we need to check whether it is
- * the current task context of this cpu. If not it has been
- * scheduled out before the smp call arrived.

- */

- if (ctx->task && cpuctx->task_ctx != ctx)

- return;
-
raw_spin_lock(&ctx->lock);
-
event_sched_out(event, cpuctx, ctx);
-
list_del_event(event, ctx);
-
raw_spin_unlock(&ctx->lock);
+
+ return 0;
}

/*
* Remove the event from a task's (or a CPU's) list of events.
*
- * Must be called with ctx->mutex held.
- *
* CPU events are removed with a smp call. For task events we only
* call when the task is on a CPU.
*
@@ -657,49 +719,48 @@ static void __perf_event_remove_from_context(void *info)

* When called from perf_event_exit_task, it's OK because the
* context has been detached from its task.
*/
-static void perf_event_remove_from_context(struct perf_event *event)
+static void perf_remove_from_context(struct perf_event *event)
{

struct perf_event_context *ctx = event->ctx;

struct task_struct *task = ctx->task;

+ lockdep_assert_held(&ctx->mutex);
+
if (!task) {
/*

+static int __perf_event_disable(void *info)

{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;

@@ -708,9 +769,12 @@ static void __perf_event_disable(void *info)
/*

* If this is a per-task event, need to check whether this

* event's task is the current task on this cpu.

+ *
+ * Can trigger due to concurrent perf_event_context_sched_out()
+ * flipping contexts around.
*/

if (ctx->task && cpuctx->task_ctx != ctx)

- return;
+ return -EINVAL;

raw_spin_lock(&ctx->lock);

@@ -729,6 +793,8 @@ static void __perf_event_disable(void *info)

}

raw_spin_unlock(&ctx->lock);
+
+ return 0;
}

/*

@@ -753,13 +819,13 @@ void perf_event_disable(struct perf_event *event)

/*
* Disable the event on the cpu that it's on
*/
- smp_call_function_single(event->cpu, __perf_event_disable,
- event, 1);
+ cpu_function_call(event->cpu, __perf_event_disable, event);
return;
}

retry:
- task_oncpu_function_call(task, __perf_event_disable, event);
+ if (!task_function_call(task, __perf_event_disable, event))
+ return;

raw_spin_lock_irq(&ctx->lock);
/*

@@ -767,6 +833,11 @@ retry:

*/
if (event->state == PERF_EVENT_STATE_ACTIVE) {
raw_spin_unlock_irq(&ctx->lock);
+ /*
+ * Reload the task pointer, it might have been changed by
+ * a concurrent perf_event_context_sched_out().
+ */
+ task = ctx->task;
goto retry;
}

@@ -778,7 +849,6 @@ retry:

update_group_times(event);
event->state = PERF_EVENT_STATE_OFF;
}
-
raw_spin_unlock_irq(&ctx->lock);
}

@@ -928,12 +998,14 @@ static void add_event_to_ctx(struct perf_event *event,

event->tstamp_stopped = tstamp;
}

+static void perf_event_context_sched_in(struct perf_event_context *ctx);
+
/*
* Cross CPU call to install and enable a performance event
*
* Must be called with ctx->mutex held
*/

-static void __perf_install_in_context(void *info)
+static int __perf_install_in_context(void *info)
{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;

@@ -942,17 +1014,12 @@ static void __perf_install_in_context(void *info)

int err;

/*
- * If this is a task context, we need to check whether it is
- * the current task context of this cpu. If not it has been
- * scheduled out before the smp call arrived.
- * Or possibly this is the right context but it isn't
- * on this cpu because it had no events.
+ * In case we're installing a new context to an already running task,
+ * could also happen before perf_event_task_sched_in() on architectures
+ * which do context switches with IRQs enabled.
*/
- if (ctx->task && cpuctx->task_ctx != ctx) {
- if (cpuctx->task_ctx || ctx->task != current)
- return;
- cpuctx->task_ctx = ctx;
- }
+ if (ctx->task && !cpuctx->task_ctx)
+ perf_event_context_sched_in(ctx);

raw_spin_lock(&ctx->lock);
ctx->is_active = 1;
@@ -997,6 +1064,8 @@ static void __perf_install_in_context(void *info)

unlock:
raw_spin_unlock(&ctx->lock);
+
+ return 0;
}

/*

@@ -1008,8 +1077,6 @@ unlock:
* If the event is attached to a task which is on a CPU we use a smp
* call to enable it in the task context. The task might have been
* scheduled away, but we check this in the smp call again.
- *
- * Must be called with ctx->mutex held.
*/

static void
perf_install_in_context(struct perf_event_context *ctx,

@@ -1018,6 +1085,8 @@ perf_install_in_context(struct perf_event_context *ctx,
{

struct task_struct *task = ctx->task;

+ lockdep_assert_held(&ctx->mutex);
+

event->ctx = ctx;

if (!task) {

@@ -1025,31 +1094,29 @@ perf_install_in_context(struct perf_event_context *ctx,

@@ -1078,7 +1145,7 @@ static void __perf_event_mark_enabled(struct perf_event *event,

/*
* Cross CPU call to enable a performance event
*/
-static void __perf_event_enable(void *info)

+static int __perf_event_enable(void *info)

{
struct perf_event *event = info;
struct perf_event_context *ctx = event->ctx;

@@ -1086,18 +1153,10 @@ static void __perf_event_enable(void *info)

struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
int err;

- /*
- * If this is a per-task event, need to check whether this
- * event's task is the current task on this cpu.

- */

- if (ctx->task && cpuctx->task_ctx != ctx) {
- if (cpuctx->task_ctx || ctx->task != current)
- return;
- cpuctx->task_ctx = ctx;
- }

+ if (WARN_ON_ONCE(!ctx->is_active))
+ return -EINVAL;

raw_spin_lock(&ctx->lock);
- ctx->is_active = 1;

update_context_time(ctx);

if (event->state >= PERF_EVENT_STATE_INACTIVE)

@@ -1138,6 +1197,8 @@ static void __perf_event_enable(void *info)

unlock:
raw_spin_unlock(&ctx->lock);
+
+ return 0;
}

/*

@@ -1158,8 +1219,7 @@ void perf_event_enable(struct perf_event *event)

/*
* Enable the event on the cpu that it's on
*/
- smp_call_function_single(event->cpu, __perf_event_enable,
- event, 1);
+ cpu_function_call(event->cpu, __perf_event_enable, event);
return;
}

@@ -1178,8 +1238,15 @@ void perf_event_enable(struct perf_event *event)

event->state = PERF_EVENT_STATE_OFF;

retry:
+ if (!ctx->is_active) {
+ __perf_event_mark_enabled(event, ctx);
+ goto out;
+ }
+
raw_spin_unlock_irq(&ctx->lock);
- task_oncpu_function_call(task, __perf_event_enable, event);
+
+ if (!task_function_call(task, __perf_event_enable, event))
+ return;

raw_spin_lock_irq(&ctx->lock);

@@ -1187,15 +1254,14 @@ retry:

* If the context is active and the event is still off,
* we need to retry the cross-call.
*/
- if (ctx->is_active && event->state == PERF_EVENT_STATE_OFF)
+ if (ctx->is_active && event->state == PERF_EVENT_STATE_OFF) {
+ /*
+ * task could have been flipped by a concurrent
+ * perf_event_context_sched_out()
+ */
+ task = ctx->task;
goto retry;
-
- /*
- * Since we have the lock this context can't be scheduled
- * in, so we can change the state safely.
- */
- if (event->state == PERF_EVENT_STATE_OFF)
- __perf_event_mark_enabled(event, ctx);
+ }

out:
raw_spin_unlock_irq(&ctx->lock);

@@ -1339,8 +1405,8 @@ static void perf_event_sync_stat(struct perf_event_context *ctx,

}
}

-void perf_event_context_sched_out(struct task_struct *task, int ctxn,
- struct task_struct *next)
+static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
+ struct task_struct *next)
{
struct perf_event_context *ctx = task->perf_event_ctxp[ctxn];
struct perf_event_context *next_ctx;

@@ -1541,7 +1607,7 @@ static void task_ctx_sched_in(struct perf_event_context *ctx,

cpuctx->task_ctx = ctx;
}

-void perf_event_context_sched_in(struct perf_event_context *ctx)
+static void perf_event_context_sched_in(struct perf_event_context *ctx)
{
struct perf_cpu_context *cpuctx;

@@ -5949,10 +6015,10 @@ SYSCALL_DEFINE5(perf_event_open,

struct perf_event_context *gctx = group_leader->ctx;

mutex_lock(&gctx->mutex);
- perf_event_remove_from_context(group_leader);
+ perf_remove_from_context(group_leader);
list_for_each_entry(sibling, &group_leader->sibling_list,
group_entry) {
- perf_event_remove_from_context(sibling);
+ perf_remove_from_context(sibling);
put_ctx(gctx);
}
mutex_unlock(&gctx->mutex);

@@ -6103,7 +6169,7 @@ __perf_event_exit_task(struct perf_event *child_event,

{
struct perf_event *parent_event;

- perf_event_remove_from_context(child_event);
+ perf_remove_from_context(child_event);

parent_event = child_event->parent;
/*

@@ -6594,9 +6660,9 @@ static void __perf_event_exit_context(void *__info)

perf_pmu_rotate_stop(ctx->pmu);

list_for_each_entry_safe(event, tmp, &ctx->pinned_groups, group_entry)
- __perf_event_remove_from_context(event);
+ __perf_remove_from_context(event);
list_for_each_entry_safe(event, tmp, &ctx->flexible_groups, group_entry)
- __perf_event_remove_from_context(event);
+ __perf_remove_from_context(event);
}

static void perf_event_exit_cpu_context(int cpu)
diff --git a/kernel/sched.c b/kernel/sched.c

index 477e1bc..6daa737 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2297,27 +2297,6 @@ void kick_process(struct task_struct *p)

@@ -2809,9 +2788,12 @@ static inline void

prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
+ sched_info_switch(prev, next);
+ perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
+ trace_sched_switch(prev, next);
}

/**

@@ -2944,7 +2926,7 @@ context_switch(struct rq *rq, struct task_struct *prev,

struct mm_struct *mm, *oldmm;

prepare_task_switch(rq, prev, next);
- trace_sched_switch(prev, next);
+
mm = next->mm;
oldmm = prev->active_mm;
/*

@@ -4116,9 +4098,6 @@ need_resched_nonpreemptible:

rq->skip_clock_update = 0;

if (likely(prev != next)) {
- sched_info_switch(prev, next);
- perf_event_task_sched_out(prev, next);
-
rq->nr_switches++;
rq->curr = next;
++*switch_count;

Oleg Nesterov

unread,

Feb 1, 2011, 12:40:02 PM2/1/11

to

On 02/01, Peter Zijlstra wrote:
>
> Oleg, I've actually run-tested the below and all seems well (clearly
> I've never actually hit the races found before either, so in that
> respect its not a conclusive test).
>
> Can you agree with this patch?

You know, I already wrote the i-think-it-is-correct email. But then
I decided to read it once again.

> -static void __perf_event_remove_from_context(void *info)
> +static int __perf_remove_from_context(void *info)
> {
> struct perf_event *event = info;
> struct perf_event_context *ctx = event->ctx;
> struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
>
> - /*
> - * If this is a task context, we need to check whether it is
> - * the current task context of this cpu. If not it has been
> - * scheduled out before the smp call arrived.
> - */
> - if (ctx->task && cpuctx->task_ctx != ctx)
> - return;

OK, I think this is right... event_sched_out() will see
PERF_EVENT_STATE_INACTIVE if perf_event_task_sched_in() was not
called yet.

But,

> -static void perf_event_remove_from_context(struct perf_event *event)
> +static void perf_remove_from_context(struct perf_event *event)
> {

> ...

> raw_spin_lock_irq(&ctx->lock);
> /*
> - * If the context is active we need to retry the smp call.
> + * If we failed to find a running task, but find it running now that
> + * we've acquired the ctx->lock, retry.
> */
> - if (ctx->nr_active && !list_empty(&event->group_entry)) {
> + if (task_curr(task)) {
> raw_spin_unlock_irq(&ctx->lock);
> goto retry;
> }
>
> /*
> - * The lock prevents that this context is scheduled in so we
> - * can remove the event safely, if the call above did not
> - * succeed.
> + * Since the task isn't running, its safe to remove the event, us
> + * holding the ctx->lock ensures the task won't get scheduled in.
> */
> - if (!list_empty(&event->group_entry))
> - list_del_event(event, ctx);
> + list_del_event(event, ctx);

this looks suspicious (the same for perf_install_in_context).

Unlike the IPI handler, this can see schedule-in-progress in any state.
In particular, we can see rq->curr == next (so that task_curr() == F),
but before "prev" has already called perf_event_task_sched_out().

So we have to check ctx->is_active, or schedule() should change rq->curr
after perf_event_task_sched_out().

> @@ -753,13 +819,13 @@ void perf_event_disable(struct perf_event *event)

> ...

> */
> if (event->state == PERF_EVENT_STATE_ACTIVE) {
> raw_spin_unlock_irq(&ctx->lock);
> + /*
> + * Reload the task pointer, it might have been changed by
> + * a concurrent perf_event_context_sched_out().
> + */
> + task = ctx->task;
> goto retry;

I am wondering why only perf_event_disable() needs this...

Just curious, this is equally needed without this patch?

Oleg.

Peter Zijlstra

unread,

Feb 1, 2011, 1:10:01 PM2/1/11

to

On Tue, 2011-02-01 at 18:27 +0100, Oleg Nesterov wrote:
> On 02/01, Peter Zijlstra wrote:
> >
> > Oleg, I've actually run-tested the below and all seems well (clearly
> > I've never actually hit the races found before either, so in that
> > respect its not a conclusive test).
> >
> > Can you agree with this patch?
>
> You know, I already wrote the i-think-it-is-correct email. But then
> I decided to read it once again.

:-)

> > -static void __perf_event_remove_from_context(void *info)
> > +static int __perf_remove_from_context(void *info)
> > {
> > struct perf_event *event = info;
> > struct perf_event_context *ctx = event->ctx;
> > struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> >
> > - /*
> > - * If this is a task context, we need to check whether it is
> > - * the current task context of this cpu. If not it has been
> > - * scheduled out before the smp call arrived.
> > - */
> > - if (ctx->task && cpuctx->task_ctx != ctx)
> > - return;
>
> OK, I think this is right... event_sched_out() will see
> PERF_EVENT_STATE_INACTIVE if perf_event_task_sched_in() was not
> called yet.

Right

I only considered current == next in that case, not current == prev, let
me undo some of those sched.c bits and put a comment.

> > @@ -753,13 +819,13 @@ void perf_event_disable(struct perf_event *event)
> > ...
> > */
> > if (event->state == PERF_EVENT_STATE_ACTIVE) {
> > raw_spin_unlock_irq(&ctx->lock);
> > + /*
> > + * Reload the task pointer, it might have been changed by
> > + * a concurrent perf_event_context_sched_out().
> > + */
> > + task = ctx->task;
> > goto retry;
>
> I am wondering why only perf_event_disable() needs this...

perf_event_enable() also has it, but you made me re-asses all that and I
think there's more to it.

perf_install_in_context() works on a ctx obtained by find_get_context(),
that context is either new (uncloned) or existing in which case it
called unclone_ctx(). So I was thinking there was no race with the ctx
flipping in perf_event_context_sched_out(), _however_ since it only
acquires ctx->mutex after calling unclone_ctx() there is a race window
with perf_event_init_task().

This race we should fix with perf_pin_task_context()

perf_remove_from_context() seems ok though, we only call that on the
install ctx, which we should fix above, or on a dying uncloned context
(no race with fork because exit doesn't clone).

> Just curious, this is equally needed without this patch?

Yes, I think it is a pre-existing problem.

Peter Zijlstra

unread,

Feb 1, 2011, 1:20:01 PM2/1/11

to

On Tue, 2011-02-01 at 19:08 +0100, Peter Zijlstra wrote:
> > > +static void perf_remove_from_context(struct perf_event *event)
> > > {
> > > ...
> > > raw_spin_lock_irq(&ctx->lock);
> > > /*

> > > + * If we failed to find a running task, but find it running now that
> > > + * we've acquired the ctx->lock, retry.
> > > */

> > > + if (task_curr(task)) {
> > > raw_spin_unlock_irq(&ctx->lock);
> > > goto retry;
> > > }
> > >
> > > /*

> > > + * Since the task isn't running, its safe to remove the event, us
> > > + * holding the ctx->lock ensures the task won't get scheduled in.
> > > */

> > > + list_del_event(event, ctx);
> >
> > this looks suspicious (the same for perf_install_in_context).
> >
> > Unlike the IPI handler, this can see schedule-in-progress in any state.
> > In particular, we can see rq->curr == next (so that task_curr() == F),
> > but before "prev" has already called perf_event_task_sched_out().
> >
> > So we have to check ctx->is_active, or schedule() should change rq->curr
> > after perf_event_task_sched_out().
>
> I only considered current == next in that case, not current == prev, let
> me undo some of those sched.c bits and put a comment.

On second thought, your proposed ->is_active check seems to result in
much nicer code in sched.c. Let me think through that.

Peter Zijlstra

unread,

Feb 1, 2011, 4:00:01 PM2/1/11

to

On Tue, 2011-02-01 at 19:08 +0100, Peter Zijlstra wrote:
> perf_install_in_context() works on a ctx obtained by find_get_context(),
> that context is either new (uncloned) or existing in which case it
> called unclone_ctx(). So I was thinking there was no race with the ctx
> flipping in perf_event_context_sched_out(), _however_ since it only
> acquires ctx->mutex after calling unclone_ctx() there is a race window
> with perf_event_init_task().
>
> This race we should fix with perf_pin_task_context()

I came up with the below.. I'll give it some runtime tomorrow, my brain
just gave up for today..

---
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -327,7 +327,6 @@ static void perf_unpin_context(struct pe
raw_spin_lock_irqsave(&ctx->lock, flags);
--ctx->pin_count;
raw_spin_unlock_irqrestore(&ctx->lock, flags);
- put_ctx(ctx);
}

/*
@@ -741,10 +740,10 @@ static void perf_remove_from_context(str

raw_spin_lock_irq(&ctx->lock);
/*
- * If we failed to find a running task, but find it running now that
- * we've acquired the ctx->lock, retry.
+ * If we failed to find a running task, but find the context active now
+ * that we've acquired the ctx->lock, retry.
*/
- if (task_curr(task)) {
+ if (ctx->is_active) {
raw_spin_unlock_irq(&ctx->lock);
goto retry;
}
@@ -1104,10 +1103,10 @@ perf_install_in_context(struct perf_even

raw_spin_lock_irq(&ctx->lock);
/*
- * If we failed to find a running task, but find it running now that
- * we've acquired the ctx->lock, retry.
+ * If we failed to find a running task, but find the context active now
+ * that we've acquired the ctx->lock, retry.
*/
- if (task_curr(task)) {
+ if (ctx->is_active) {
raw_spin_unlock_irq(&ctx->lock);
goto retry;
}
@@ -2278,6 +2277,9 @@ find_lively_task_by_vpid(pid_t vpid)

}

+/*
+ * Returns a matching context with refcount and pincount.
+ */
static struct perf_event_context *
find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)
{
@@ -2302,6 +2304,7 @@ find_get_context(struct pmu *pmu, struct
cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
ctx = &cpuctx->ctx;
get_ctx(ctx);
+ ++ctx->pin_count;

return ctx;
}
@@ -2315,6 +2318,7 @@ find_get_context(struct pmu *pmu, struct
ctx = perf_lock_task_context(task, ctxn, &flags);
if (ctx) {
unclone_ctx(ctx);
+ ++ctx->pin_count;
raw_spin_unlock_irqrestore(&ctx->lock, flags);
}

@@ -6041,6 +6045,7 @@ SYSCALL_DEFINE5(perf_event_open,

perf_install_in_context(ctx, event, cpu);
++ctx->generation;
+ perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);

event->owner = current;
@@ -6066,6 +6071,7 @@ SYSCALL_DEFINE5(perf_event_open,
return event_fd;

err_context:
+ perf_unpin_context(ctx);
put_ctx(ctx);
err_alloc:
free_event(event);
@@ -6116,6 +6122,7 @@ perf_event_create_kernel_counter(struct
mutex_lock(&ctx->mutex);
perf_install_in_context(ctx, event, cpu);
++ctx->generation;
+ perf_unpin_context(ctx);
mutex_unlock(&ctx->mutex);

return event;
@@ -6591,6 +6598,7 @@ int perf_event_init_context(struct task_
mutex_unlock(&parent_ctx->mutex);

perf_unpin_context(parent_ctx);
+ put_ctx(parent_ctx);

return ret;

Q: sys_perf_event_open() && PF_EXITING

Oleg Nesterov

Oleg Nesterov

Oleg Nesterov

Oleg Nesterov

Peter Zijlstra

Peter Zijlstra

tip-bot for Oleg Nesterov

tip-bot for Oleg Nesterov

Oleg Nesterov

Peter Zijlstra

Ingo Molnar

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Ingo Molnar

Oleg Nesterov

Oleg Nesterov

Frederic Weisbecker

Paul E. McKenney

tip-bot for Oleg Nesterov

Oleg Nesterov

Oleg Nesterov

Oleg Nesterov

Oleg Nesterov

Oleg Nesterov

Peter Zijlstra

Peter Zijlstra

Oleg Nesterov

Oleg Nesterov

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Oleg Nesterov

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

Oleg Nesterov

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra