Q: perf_event && task->ptrace

Oleg Nesterov

unread,

Nov 8, 2010, 10:10:02 AM11/8/10

to

Hello.

I am trying to understand the usage of hw-breakpoints in arch_ptrace().
ptrace_set_debugreg() and related code looks obviously racy. Nothing
protects us against flush_ptrace_hw_breakpoint() called by the dying
tracee. Afaics we can leak perf_event or use the already freed memory
or both.

Am I missed something?

Looking into the git history, I don't even know which patch should be
blamed (if I am right), there were too many changes. I noticed that
2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
search into helper" moved the PF_EXITING check from find_get_context().
This check coould help if sys_ptrace() races with SIGKILL, but it was
racy anyway.

It is not clear to me what should be done. Looking more, I do not
understand the scope of perf_event/ctx at all, sys_perf_event_open()
looks wrong too, see the next email I am going to send.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Oleg Nesterov

unread,

Nov 8, 2010, 10:10:02 AM11/8/10

to

Another thing I can't understand, event->owner/owner_entry.

Say, some thread calls sys_perf_event_open() and creates the event.
It becomes its owner. Now this thread exits, but fd/event are still
here, and event->owner refers to the dead task_struct.

ptrace looks even more strange. Debugger can attach the breakpoint
to the tracee and then exit/detach. ->ptrace_bps events still point
to the same (may be dead) task. Even if another debugger attaches
and reuses these events.

And for what? Afaics, this is only used by PR_TASK_PERF_EVENTS_xxABLE.
Looks like, tools/perf/ used prctl() in the past. Perhaps this API
can die now and we can kill ->owner/owner_entry?

Frederic Weisbecker

unread,

Nov 8, 2010, 1:50:02 PM11/8/10

to

On Mon, Nov 08, 2010 at 03:56:47PM +0100, Oleg Nesterov wrote:
> Hello.
>
> I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> ptrace_set_debugreg() and related code looks obviously racy. Nothing
> protects us against flush_ptrace_hw_breakpoint() called by the dying
> tracee. Afaics we can leak perf_event or use the already freed memory
> or both.
>
> Am I missed something?
>
> Looking into the git history, I don't even know which patch should be
> blamed (if I am right), there were too many changes. I noticed that
> 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> search into helper" moved the PF_EXITING check from find_get_context().
> This check coould help if sys_ptrace() races with SIGKILL, but it was
> racy anyway.
>
> It is not clear to me what should be done. Looking more, I do not
> understand the scope of perf_event/ctx at all, sys_perf_event_open()
> looks wrong too, see the next email I am going to send.
>
> Oleg.
>

But I don't understand how ptrace_set_debugreg() and flush_old_exec() can
happen at the same time. The parent can only do the ptrace request when
the child is stopped, right? But it can't be stopped in flush_old_exec()...?

Not sure how any race can happen here. I am certainly missing something obvious.

Thanks.

Oleg Nesterov

unread,

Nov 8, 2010, 2:30:01 PM11/8/10

to

On 11/08, Frederic Weisbecker wrote:
>
> On Mon, Nov 08, 2010 at 03:56:47PM +0100, Oleg Nesterov wrote:
> > Hello.
> >
> > I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> > ptrace_set_debugreg() and related code looks obviously racy. Nothing
> > protects us against flush_ptrace_hw_breakpoint() called by the dying
> > tracee. Afaics we can leak perf_event or use the already freed memory
> > or both.
> >
> > Am I missed something?
> >
> > Looking into the git history, I don't even know which patch should be
> > blamed (if I am right), there were too many changes. I noticed that
> > 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> > search into helper" moved the PF_EXITING check from find_get_context().
> > This check coould help if sys_ptrace() races with SIGKILL, but it was
> > racy anyway.
> >
> > It is not clear to me what should be done. Looking more, I do not
> > understand the scope of perf_event/ctx at all, sys_perf_event_open()
> > looks wrong too, see the next email I am going to send.
> >
> > Oleg.
>
> But I don't understand how ptrace_set_debugreg() and flush_old_exec() can
> happen at the same time.

This can't happen. But I meant do_exit()->flush_ptrace_hw_breakpoint()

> The parent can only do the ptrace request when
> the child is stopped, right?

Yes. But nothing can "pin" TASK_TRACED.

We know that a) the tracee was stopped() when sys_ptrace() was called
and b) its task_struct can't go away. That is all. The tracee can be
killed at any moment, and sys_ptrace() can race with with
flush_ptrace_hw_breakpoint().

> I am certainly missing something obvious.

Perhaps ;) Or, it is quite possible I missed something, I never read
this code before and it is certainly not trivial.

Oleg.

Frederic Weisbecker

unread,

Nov 8, 2010, 3:20:02 PM11/8/10

to

On Mon, Nov 08, 2010 at 03:57:54PM +0100, Oleg Nesterov wrote:
> Another thing I can't understand, event->owner/owner_entry.
>
> Say, some thread calls sys_perf_event_open() and creates the event.
> It becomes its owner. Now this thread exits, but fd/event are still
> here, and event->owner refers to the dead task_struct.

Hmm, it seems to me that the last reference to the events are
put in __perf_event_exit_task, and then free_event() is called
there, which rcu queues the event to be released.

Not sure where is the issue here.

> ptrace looks even more strange. Debugger can attach the breakpoint
> to the tracee and then exit/detach. ->ptrace_bps events still point
> to the same (may be dead) task. Even if another debugger attaches
> and reuses these events.

Hmm, in this case ptrace_bps will continue to trigger on the task
they were applied.

On the other hand, you're right, I'm not sure that the debugger is
the correct owner for the breakpoints.
I think it works though, looking at perf_event_create_kernel_counter():

event->owner = current;
get_task_struct(current);

(current is the debugger)

On perf_event_release_kernel():

put_task_struct(event->owner);

So even if the debugger dies, we keep a valid owner, it works but just makes
few sense as the debugger can change.
Perhaps the real owner should be the task on which we attach our breakpoint.

What do you think?

Peter Zijlstra

unread,

Nov 8, 2010, 3:50:01 PM11/8/10

to

On Mon, 2010-11-08 at 21:11 +0100, Frederic Weisbecker wrote:
> Perhaps the real owner should be the task on which we attach our
> breakpoint.

No the point of event->owner is to point to the task that creates the
event, not the task we possibly attach it to (that should be reachable
through event->ctx->task).

As to removing event->owner as Oleg suggests, its a published ABI and
there might be people using it.

The use-case is a monitor thread wanting to stop all monitoring it
initiated, for example because its wants to synchronize various counters
attached to different tasks etc..

Oleg Nesterov

unread,

Nov 9, 2010, 11:10:02 AM11/9/10

to

On 11/08, Frederic Weisbecker wrote:
>

> On Mon, Nov 08, 2010 at 03:57:54PM +0100, Oleg Nesterov wrote:
> > Another thing I can't understand, event->owner/owner_entry.
> >
> > Say, some thread calls sys_perf_event_open() and creates the event.
> > It becomes its owner. Now this thread exits, but fd/event are still
> > here, and event->owner refers to the dead task_struct.
>
> Hmm, it seems to me that the last reference to the events are
> put in __perf_event_exit_task,

I think no, in this case sys_perf_event_open() owns the event. IOW,
perf_release() frees this perf_event. But this doesn't matter.

> and then free_event() is called
> there, which rcu queues the event to be released.
>
> Not sure where is the issue here.

I am not saying this is buggy. But it looks very strange to me.

If the creator of perf_event dies, nobody can use its ->perf_event_list
anyway. What is the point to keep the reference to the dead task_struct
and preserve this ->perf_event_list?

And why do we need ->owner at all? Afaics, it is _only_ needed to find
->perf_event_mutex in perf_event_release_kernel(). And this mutex only
protects ->perf_event_list (mostly for prctl).

Of course, I understand that it is not completely trivial to change this.
The exiting creator can clear its ->perf_event_list and set
event->owner = NULL, but then perf_event_release_kernel() should
avoid the races with do_exit() somehow.

> > ptrace looks even more strange. Debugger can attach the breakpoint
> > to the tracee and then exit/detach. ->ptrace_bps events still point
> > to the same (may be dead) task. Even if another debugger attaches
> > and reuses these events.
>
>
>
> Hmm, in this case ptrace_bps will continue to trigger on the task
> they were applied.
>
> On the other hand, you're right, I'm not sure that the debugger is
> the correct owner for the breakpoints.
> I think it works though, looking at perf_event_create_kernel_counter():
>
> event->owner = current;
> get_task_struct(current);
>
> (current is the debugger)
>
> On perf_event_release_kernel():
>
> put_task_struct(event->owner);
>
> So even if the debugger dies, we keep a valid owner, it works but just makes
> few sense as the debugger can change.

Yes, it works, but I am not sure about "valid" above ;) Even if the previous
debugger doesn't exit.

And. Suppose that the new debugger attaches and reuses ->ptrace_bps[],
everything works.

Now, the former debugger does prctl(PR_TASK_PERF_EVENTS_DISABLE) and
suddenly bps stop working.

Not to mention this looks racy. Can't prctl() doing perf_event_disable/enable
race with modify_user_hw_breakpoint/unregister_hw_breakpoint/etc ?

> Perhaps the real owner should be the task on which we attach our breakpoint.

Not sure... What for?

In any case, I don't think the tracee should "control" this event.

Oleg.

Oleg Nesterov

unread,

Nov 9, 2010, 11:30:02 AM11/9/10

to

On 11/08, Peter Zijlstra wrote:
>
> As to removing event->owner as Oleg suggests, its a published ABI and
> there might be people using it.

This was my main question.

It's a pity ;) This ABI doesn't look very nice, but OK.

Oleg.

Peter Zijlstra

unread,

Nov 9, 2010, 12:00:03 PM11/9/10

to

On Tue, 2010-11-09 at 16:57 +0100, Oleg Nesterov wrote:
>
> If the creator of perf_event dies, nobody can use its ->perf_event_list
> anyway. What is the point to keep the reference to the dead task_struct
> and preserve this ->perf_event_list?

But when the owner dies it will close all its fds, which means it will
clear its tsk->perf_event_list, no? (With exception of the case where
the fd was passed through a unix-socket to another process).

Oleg Nesterov

unread,

Nov 9, 2010, 12:10:02 PM11/9/10

to

On 11/09, Peter Zijlstra wrote:
>
> On Tue, 2010-11-09 at 16:57 +0100, Oleg Nesterov wrote:
> >
> > If the creator of perf_event dies, nobody can use its ->perf_event_list
> > anyway. What is the point to keep the reference to the dead task_struct
> > and preserve this ->perf_event_list?
>
> But when the owner dies it will close all its fds, which means it will
> clear its tsk->perf_event_list, no? (With exception of the case where
> the fd was passed through a unix-socket to another process).

fork(), pthread_create(). Only __fput() calls ->release, when the last
reference to file goes away.

And ptrace(), it doesn't use sys_perf_event_open() to create the event.

Oleg.

Peter Zijlstra

unread,

Nov 9, 2010, 12:10:03 PM11/9/10

to

On Tue, 2010-11-09 at 17:58 +0100, Oleg Nesterov wrote:
> On 11/09, Peter Zijlstra wrote:
> >
> > On Tue, 2010-11-09 at 16:57 +0100, Oleg Nesterov wrote:
> > >
> > > If the creator of perf_event dies, nobody can use its ->perf_event_list
> > > anyway. What is the point to keep the reference to the dead task_struct
> > > and preserve this ->perf_event_list?
> >
> > But when the owner dies it will close all its fds, which means it will
> > clear its tsk->perf_event_list, no? (With exception of the case where
> > the fd was passed through a unix-socket to another process).
>
> fork(), pthread_create(). Only __fput() calls ->release, when the last
> reference to file goes away.

Ah,.. quite so. So how about we explicitly destroy the list when the
task dies?

> And ptrace(), it doesn't use sys_perf_event_open() to create the event.

Right, I guess it uses kernel based things, I guess we could not add
kernel based counters to the list.

Oleg Nesterov

unread,

Nov 9, 2010, 12:50:02 PM11/9/10

to

On 11/09, Peter Zijlstra wrote:
>
> Ah,.. quite so. So how about we explicitly destroy the list when the
> task dies?

Yes, I think it makes sense to destroy the list and set ->owner = NULL.
If we reset the owner, we can also avoid get_task_struct().

The only problem is perf_event_release_kernel(), it can race with the
exiting event->owner. It can do get_task_struct() under rcu lock temporary,
just to take the mutex and remove the entry.

> > And ptrace(), it doesn't use sys_perf_event_open() to create the event.
>
> Right, I guess it uses kernel based things, I guess we could not add
> kernel based counters to the list.

Agreed, another case when event->owner should be NULL.

Hmm. With or without these changes. Shouldn't perf_event_release_kernel()
remove the event from list before anything else? Otherwise, afaics a thread
which does close(event_fd) can race with creator doing prctl(EVENTS_ENABLE),
no?

Oleg.

Peter Zijlstra

unread,

Nov 9, 2010, 1:10:01 PM11/9/10

to

On Tue, 2010-11-09 at 18:42 +0100, Oleg Nesterov wrote:
> On 11/09, Peter Zijlstra wrote:
> >
> > Ah,.. quite so. So how about we explicitly destroy the list when the
> > task dies?
>
> Yes, I think it makes sense to destroy the list and set ->owner = NULL.
> If we reset the owner, we can also avoid get_task_struct().
>
> The only problem is perf_event_release_kernel(), it can race with the
> exiting event->owner. It can do get_task_struct() under rcu lock temporary,
> just to take the mutex and remove the entry.
>
> > > And ptrace(), it doesn't use sys_perf_event_open() to create the event.
> >
> > Right, I guess it uses kernel based things, I guess we could not add
> > kernel based counters to the list.
>
> Agreed, another case when event->owner should be NULL.
>
>
>
> Hmm. With or without these changes. Shouldn't perf_event_release_kernel()
> remove the event from list before anything else? Otherwise, afaics a thread
> which does close(event_fd) can race with creator doing prctl(EVENTS_ENABLE),
> no?

I think you're right, how about something like this?

---
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2234,11 +2234,6 @@ int perf_event_release_kernel(struct per
raw_spin_unlock_irq(&ctx->lock);
mutex_unlock(&ctx->mutex);

- mutex_lock(&event->owner->perf_event_mutex);
- list_del_init(&event->owner_entry);
- mutex_unlock(&event->owner->perf_event_mutex);
- put_task_struct(event->owner);
-
free_event(event);

return 0;
@@ -2254,6 +2249,12 @@ static int perf_release(struct inode *in

file->private_data = NULL;

+ if (event->owner) {
+ mutex_lock(&event->owner->perf_event_mutex);
+ list_del_init(&event->owner_entry);
+ mutex_unlock(&event->owner->perf_event_mutex);
+ }
+
return perf_event_release_kernel(event);
}

@@ -5677,7 +5678,7 @@ SYSCALL_DEFINE5(perf_event_open,
mutex_unlock(&ctx->mutex);

event->owner = current;
- get_task_struct(current);
+
mutex_lock(&current->perf_event_mutex);
list_add_tail(&event->owner_entry, &current->perf_event_list);
mutex_unlock(&current->perf_event_mutex);
@@ -5745,12 +5746,6 @@ perf_event_create_kernel_counter(struct
++ctx->generation;
mutex_unlock(&ctx->mutex);

- event->owner = current;
- get_task_struct(current);
- mutex_lock(&current->perf_event_mutex);
- list_add_tail(&event->owner_entry, &current->perf_event_list);
- mutex_unlock(&current->perf_event_mutex);
-
return event;

err_free:
@@ -5901,8 +5896,16 @@ static void perf_event_exit_task_context
*/
void perf_event_exit_task(struct task_struct *child)
{
+ struct perf_event *event, *tmp;
int ctxn;

+ mutex_lock(&child->perf_event_mutex);
+ list_for_each_entry_safe(event, tmp, &child->perf_event_list,
+ owner_entry) {
+ list_del_init(&event->owner_entry);
+ }
+ mutex_unlock(&child->perf_event_mutex);
+
for_each_task_context_nr(ctxn)
perf_event_exit_task_context(child, ctxn);

Oleg Nesterov

unread,

Nov 9, 2010, 2:10:01 PM11/9/10

to

On 11/09, Peter Zijlstra wrote:
>
> I think you're right, how about something like this?

I need to read it with a fresh head ;)

At first glance,

> @@ -2254,6 +2249,12 @@ static int perf_release(struct inode *in
>
> file->private_data = NULL;
>
> + if (event->owner) {
> + mutex_lock(&event->owner->perf_event_mutex);
> + list_del_init(&event->owner_entry);
> + mutex_unlock(&event->owner->perf_event_mutex);
> + }

Agreed, it is better to do this in perf_release().

But, this can use the already freed task_struct, event->owner.

Either sys_perf_open() should do get_task_struct() like we currently
do, or perf_event_exit_task() should clear event->owner and then
perf_release() should do something like

rcu_read_lock();
owner = event->owner;
if (owner)
get_task_struct(owner);
rcu_read_unlock();

if (owner) {
mutex_lock(&event->owner->perf_event_mutex);
list_del_init(&event->owner_entry);
mutex_unlock(&event->owner->perf_event_mutex);
put_task_struct(owner);
}

Probably this can be simplified...

Oleg.

Peter Zijlstra

unread,

Nov 9, 2010, 2:20:02 PM11/9/10

to

On Tue, 2010-11-09 at 19:57 +0100, Oleg Nesterov wrote:
> On 11/09, Peter Zijlstra wrote:
> >
> > I think you're right, how about something like this?
>
> I need to read it with a fresh head ;)

You're right, and I seem to suffer from a similar problem, will respin
tomorrow.

Peter Zijlstra

unread,

Nov 10, 2010, 10:20:02 AM11/10/10

to

On Tue, 2010-11-09 at 19:57 +0100, Oleg Nesterov wrote:
> Either sys_perf_open() should do get_task_struct() like we currently
> do, or perf_event_exit_task() should clear event->owner and then
> perf_release() should do something like
>
> rcu_read_lock();
> owner = event->owner;
> if (owner)
> get_task_struct(owner);
> rcu_read_unlock();
>
> if (owner) {
> mutex_lock(&event->owner->perf_event_mutex);
> list_del_init(&event->owner_entry);
> mutex_unlock(&event->owner->perf_event_mutex);
> put_task_struct(owner);
> }
>
> Probably this can be simplified...

I think that's still racy, suppose we do:

void perf_event_exit_task(struct task_struct *child)
{

struct perf_event *event, *tmp;
int ctxn;

mutex_lock(&child->perf_event_mutex);
list_for_each_entry_safe(event, tmp, &child->perf_event_list,
owner_entry) {
event->owner = NULL;
list_del_init(&event->owner_entry);
}
mutex_unlock(&child->perf_event_mutex);

for_each_task_context_nr(ctxn)
perf_event_exit_task_context(child, ctxn);
}

and the close() races with an exit, then couldn't we observe
event->owner after the last put_task_struct()? In which case a
get_task_struct() will result in a double-free.

Oleg Nesterov

unread,

Nov 10, 2010, 11:00:02 AM11/10/10

to

I think no. Note that we do not just free task_struct via rcu callback.
Instead, delayed_put_task_struct() drops the (may be) last reference.

But the code is racy, yes. owner != NULL case is fine. But
perf_release() can see event->owner == NULL before list_del() was
completed. perf_event_exit_task() needs wmb() in between, I think.

Oleg.

Peter Zijlstra

unread,

Nov 12, 2010, 10:50:02 AM11/12/10

to

On Wed, 2010-11-10 at 16:44 +0100, Oleg Nesterov wrote:
>
> But the code is racy, yes. owner != NULL case is fine. But
> perf_release() can see event->owner == NULL before list_del() was
> completed. perf_event_exit_task() needs wmb() in between, I think.
>

Utter paranoia took over and I'm still not sure its solid...

---
Subject: perf: Fix owner-list vs exit
From: Peter Zijlstra <a.p.zi...@chello.nl>
Date: Tue, 09 Nov 2010 19:01:43 +0100

Reported-by: Oleg Nesterov <ol...@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
LKML-Reference: <1289325703.2191.60.camel@laptop>
---
kernel/perf_event.c | 63 ++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 51 insertions(+), 12 deletions(-)

Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2234,11 +2234,6 @@ int perf_event_release_kernel(struct per
raw_spin_unlock_irq(&ctx->lock);
mutex_unlock(&ctx->mutex);

- mutex_lock(&event->owner->perf_event_mutex);
- list_del_init(&event->owner_entry);
- mutex_unlock(&event->owner->perf_event_mutex);
- put_task_struct(event->owner);
-
free_event(event);

return 0;

@@ -2251,9 +2246,43 @@ EXPORT_SYMBOL_GPL(perf_event_release_ker
static int perf_release(struct inode *inode, struct file *file)
{
struct perf_event *event = file->private_data;
+ struct task_struct *owner;

file->private_data = NULL;

+ rcu_read_lock();
+ owner = ACCESS_ONCE(event->owner);
+ /*
+ * Matches the smp_wmb() in perf_event_exit_task(). If we observe
+ * !owner it means the list deletion is complete and we can indeed
+ * free this event, otherwise we need to serialize on
+ * owner->perf_event_mutex.
+ */
+ smp_read_barrier_depends();
+ if (owner) {
+ /*
+ * Since delayed_put_task_struct() also drops the last
+ * task reference we can safely take a new reference
+ * while holding the rcu_read_lock().
+ */
+ get_task_struct(owner);
+ }
+ rcu_read_unlock();
+
+ if (owner) {
+ mutex_lock(&owner->perf_event_mutex);
+ /*
+ * We have to re-check the event->owner field, if it is cleared
+ * we raced with perf_event_exit_task(), acquiring the mutex
+ * ensured they're done, and we can proceed with freeing the
+ * event.
+ */
+ if (event->owner)
+ list_del_init(&event->owner_entry);
+ mutex_unlock(&owner->perf_event_mutex);
+ put_task_struct(owner);
+ }
+
return perf_event_release_kernel(event);
}

@@ -5677,7 +5706,7 @@ SYSCALL_DEFINE5(perf_event_open,

mutex_unlock(&ctx->mutex);

event->owner = current;
- get_task_struct(current);
+
mutex_lock(&current->perf_event_mutex);
list_add_tail(&event->owner_entry, &current->perf_event_list);
mutex_unlock(&current->perf_event_mutex);

@@ -5745,12 +5774,6 @@ perf_event_create_kernel_counter(struct

++ctx->generation;
mutex_unlock(&ctx->mutex);

- event->owner = current;
- get_task_struct(current);
- mutex_lock(&current->perf_event_mutex);
- list_add_tail(&event->owner_entry, &current->perf_event_list);
- mutex_unlock(&current->perf_event_mutex);
-
return event;

err_free:

@@ -5901,8 +5924,24 @@ static void perf_event_exit_task_context

*/
void perf_event_exit_task(struct task_struct *child)
{

+ struct perf_event *event, *tmp;
int ctxn;

+ mutex_lock(&child->perf_event_mutex);
+ list_for_each_entry_safe(event, tmp, &child->perf_event_list,
+ owner_entry) {
+ list_del_init(&event->owner_entry);
+

+ /*
+ * Ensure the list deletion is visible before we clear
+ * the owner, closes a race against perf_release() where
+ * we need to serialize on the owner->perf_event_mutex.
+ */
+ smp_wmb();
+ event->owner = NULL;

+ }
+ mutex_unlock(&child->perf_event_mutex);
+
for_each_task_context_nr(ctxn)
perf_event_exit_task_context(child, ctxn);
}

--

Oleg Nesterov

unread,

Nov 12, 2010, 2:00:02 PM11/12/10

to

On 11/12, Peter Zijlstra wrote:
>
> Utter paranoia took over and I'm still not sure its solid...

Thanks Peter!

I believe this all is correct.

tip-bot for Peter Zijlstra

unread,

Nov 18, 2010, 9:10:03 AM11/18/10

to

Commit-ID: 8882135bcd332f294df5455747ea43ba9e6f77ad
Gitweb: http://git.kernel.org/tip/8882135bcd332f294df5455747ea43ba9e6f77ad
Author: Peter Zijlstra <a.p.zi...@chello.nl>
AuthorDate: Tue, 9 Nov 2010 19:01:43 +0100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Thu, 18 Nov 2010 13:18:46 +0100

perf: Fix owner-list vs exit

Oleg noticed that a perf-fd keeping a reference on the creating task
leads to a few funny side effects.

There's two different aspects to this:

- kernel based perf-events, these should not take out
a reference on the creating task and appear on the task's
event list since they're not bound to fds nor visible
to userspace.

- fork() and pthread_create(), these can lead to the creating
task dying (and thus the task's event-list becomming useless)
but keeping the list and ref alive until the event is closed.

Combined they lead to malfunction of the ptrace hw_tracepoints.

Cure this by not considering kernel based perf_events for the
owner-list and destroying the owner-list when the owner dies.

Reported-by: Oleg Nesterov <ol...@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>

Acked-by: Oleg Nesterov <ol...@redhat.com>
LKML-Reference: <1289576883.2084.286.camel@laptop>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
kernel/perf_event.c | 63 +++++++++++++++++++++++++++++++++++++++++---------
1 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index f818d9d..671f6c8 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2235,11 +2235,6 @@ int perf_event_release_kernel(struct perf_event *event)

raw_spin_unlock_irq(&ctx->lock);
mutex_unlock(&ctx->mutex);

- mutex_lock(&event->owner->perf_event_mutex);
- list_del_init(&event->owner_entry);
- mutex_unlock(&event->owner->perf_event_mutex);
- put_task_struct(event->owner);
-
free_event(event);

return 0;

@@ -2252,9 +2247,43 @@ EXPORT_SYMBOL_GPL(perf_event_release_kernel);

@@ -5678,7 +5707,7 @@ SYSCALL_DEFINE5(perf_event_open,

mutex_unlock(&ctx->mutex);

event->owner = current;
- get_task_struct(current);
+
mutex_lock(&current->perf_event_mutex);
list_add_tail(&event->owner_entry, &current->perf_event_list);
mutex_unlock(&current->perf_event_mutex);

@@ -5746,12 +5775,6 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,

++ctx->generation;
mutex_unlock(&ctx->mutex);

- event->owner = current;
- get_task_struct(current);
- mutex_lock(&current->perf_event_mutex);
- list_add_tail(&event->owner_entry, &current->perf_event_list);
- mutex_unlock(&current->perf_event_mutex);
-
return event;

err_free:

@@ -5902,8 +5925,24 @@ again:

Oleg Nesterov

unread,

Jan 17, 2011, 3:50:02 PM1/17/11

to

On 11/08, Oleg Nesterov wrote:
>
> I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> ptrace_set_debugreg() and related code looks obviously racy. Nothing
> protects us against flush_ptrace_hw_breakpoint() called by the dying
> tracee. Afaics we can leak perf_event or use the already freed memory
> or both.
>
> Am I missed something?
>
> Looking into the git history, I don't even know which patch should be
> blamed (if I am right), there were too many changes. I noticed that
> 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> search into helper" moved the PF_EXITING check from find_get_context().
> This check coould help if sys_ptrace() races with SIGKILL, but it was
> racy anyway.

Ping.

Any idea how to fix this cleanly? May be we can reuse perf_event_mutex,
but this looks soooo ugly. And do_exit()->flush_ptrace_hw_breakpoint()
has the strange "FIXME:" comment which doesn't help me to understand
what can we do.

Probably the best fix is to change this code so that the tracer owns
->ptrace_bps[], not the tracee. But this is not trivial, and needs a
lot of changes in ptrace code.

I am reading perf_event.c, but all I found so far is a couple of trivial
methods to crash the kernel via sys_perf_event_open(), will report
tomorrow...

Peter Zijlstra

unread,

Jan 17, 2011, 4:00:02 PM1/17/11

to

On Mon, 2011-01-17 at 21:34 +0100, Oleg Nesterov wrote:
> On 11/08, Oleg Nesterov wrote:
> >
> > I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> > ptrace_set_debugreg() and related code looks obviously racy. Nothing
> > protects us against flush_ptrace_hw_breakpoint() called by the dying
> > tracee. Afaics we can leak perf_event or use the already freed memory
> > or both.
> >
> > Am I missed something?
> >
> > Looking into the git history, I don't even know which patch should be
> > blamed (if I am right), there were too many changes. I noticed that
> > 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> > search into helper" moved the PF_EXITING check from find_get_context().
> > This check coould help if sys_ptrace() races with SIGKILL, but it was
> > racy anyway.
>
> Ping.
>
> Any idea how to fix this cleanly? May be we can reuse perf_event_mutex,
> but this looks soooo ugly. And do_exit()->flush_ptrace_hw_breakpoint()
> has the strange "FIXME:" comment which doesn't help me to understand
> what can we do.
>
> Probably the best fix is to change this code so that the tracer owns
> ->ptrace_bps[], not the tracee. But this is not trivial, and needs a
> lot of changes in ptrace code.

Wasn't this sorted by: 8882135bcd332f294df5455747ea43ba9e6f77ad?

Or is this purely related to the ptrace muck? in which case I guess
Frederic is you man, I never looked at the hw_breakpoint stuff in
general and the ptrace bits in particular.

> I am reading perf_event.c, but all I found so far is a couple of trivial
> methods to crash the kernel via sys_perf_event_open(), will report
> tomorrow...

Ow, that's not too pretty..

Frederic Weisbecker

unread,

Jan 17, 2011, 4:10:02 PM1/17/11

to

Yeah sorry I lost track on this and left it unanswered in the middle.
Just lemme rewalk the thread and I'm back :)

Frederic Weisbecker

unread,

Jan 17, 2011, 7:00:02 PM1/17/11

to

Aah, so we check if the task is stopped when sys_ptrace() is called,
but right after we do this check, the tracee can be resumed at any time?
(with either SIGCONT or SIGKILL), even if we are servicing the ptrace
request at the same time?

Seems to be so as I look at the code.

Roland McGrath

unread,

Jan 17, 2011, 8:20:01 PM1/17/11

to

> Aah, so we check if the task is stopped when sys_ptrace() is called,
> but right after we do this check, the tracee can be resumed at any time?
> (with either SIGCONT or SIGKILL), even if we are servicing the ptrace
> request at the same time?

Only by SIGKILL.

Oleg Nesterov

unread,

Jan 18, 2011, 11:20:01 AM1/18/11

to

Starting from perf_event_alloc()->perf_init_event(), the kernel
assumes that event->cpu is either -1 or the valid CPU number.

Change perf_event_alloc() to validate this argument early. This
also means we can remove the similar check in find_get_context().

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

kernel/perf_event.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

--- git/kernel/perf_event.c~2_perf_event_alloc 2011-01-18 16:56:40.000000000 +0100
+++ git/kernel/perf_event.c 2011-01-18 16:57:08.000000000 +0100
@@ -2233,9 +2233,6 @@ find_get_context(struct pmu *pmu, struct
if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
return ERR_PTR(-EACCES);

- if (cpu < 0 || cpu >= nr_cpumask_bits)
- return ERR_PTR(-EINVAL);
-
/*
* We could be clever and allow to attach a event to an
* offline CPU and activate it when the CPU comes up, but
@@ -5541,6 +5538,11 @@ perf_event_alloc(struct perf_event_attr
struct hw_perf_event *hwc;
long err;

+ if ((unsigned)cpu >= nr_cpu_ids) {
+ if (!task || cpu != -1)
+ return ERR_PTR(-EINVAL);
+ }
+
event = kzalloc(sizeof(*event), GFP_KERNEL);
if (!event)
return ERR_PTR(-ENOMEM);
@@ -5589,7 +5591,7 @@ perf_event_alloc(struct perf_event_attr

if (!overflow_handler && parent_event)
overflow_handler = parent_event->overflow_handler;
-
+
event->overflow_handler = overflow_handler;

if (attr->disabled)

Oleg Nesterov

unread,

Jan 18, 2011, 11:20:02 AM1/18/11

to

If task == NULL, find_get_context() should always check that cpu
is correct.

Afaics, the bug was introduced by 38a81da2 "perf events: Clean up
pid passing", but even before that commit "&& cpu != -1" was not
exactly right, -ESRCH from find_task_by_vpid() is not accurate.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

kernel/perf_event.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- git/kernel/perf_event.c~1_find_get_context 2011-01-14 18:21:05.000000000 +0100
+++ git/kernel/perf_event.c 2011-01-18 16:56:40.000000000 +0100
@@ -2228,7 +2228,7 @@ find_get_context(struct pmu *pmu, struct
unsigned long flags;
int ctxn, err;

- if (!task && cpu != -1) {
+ if (!task) {
/* Must be root to operate on a CPU event: */

if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
return ERR_PTR(-EACCES);

--

Frederic Weisbecker

unread,

Jan 18, 2011, 1:50:02 PM1/18/11

to

On Mon, Jan 17, 2011 at 09:34:59PM +0100, Oleg Nesterov wrote:
> On 11/08, Oleg Nesterov wrote:
> >
> > I am trying to understand the usage of hw-breakpoints in arch_ptrace().
> > ptrace_set_debugreg() and related code looks obviously racy. Nothing
> > protects us against flush_ptrace_hw_breakpoint() called by the dying
> > tracee. Afaics we can leak perf_event or use the already freed memory
> > or both.
> >
> > Am I missed something?
> >
> > Looking into the git history, I don't even know which patch should be
> > blamed (if I am right), there were too many changes. I noticed that
> > 2ebd4ffb6d0cb877787b1e42be8485820158857e "perf events: Split out task
> > search into helper" moved the PF_EXITING check from find_get_context().
> > This check coould help if sys_ptrace() races with SIGKILL, but it was
> > racy anyway.
>
> Ping.
>
> Any idea how to fix this cleanly? May be we can reuse perf_event_mutex,
> but this looks soooo ugly. And do_exit()->flush_ptrace_hw_breakpoint()
> has the strange "FIXME:" comment which doesn't help me to understand
> what can we do.

Yeah forget about the FIXME, it's a stale thing I need to remove.

>
> Probably the best fix is to change this code so that the tracer owns
> ->ptrace_bps[], not the tracee. But this is not trivial, and needs a
> lot of changes in ptrace code.

How much complicated would it be?

Because I see three solutions to solve this:

- Have a mutex inside thread->ptrace_bps. The contention must be
rare and only concern ptrace and tracee exit. That's the simplest.

- Have an atomic refcount inside thread->ptrace_bps so that the actual
flush can be delayed until necessary. Same as above, but exit and ptrace
can execute concurrently, code must be a bit more complicated though.

- Your solution. I'm just not sure how much change it involves. Seems
like we need to notify the parent for it to flush the breakpoints
when a task exits. Same when ptrace detaches we need to flush.

What do you guys think? At a glance it seems a mutex or a refcount
would take more memory for each thread, but I can manage to have
->ptrace_bps a block only allocated if necessary. It would be only
a pointer if no breakpoint is queued.

tip-bot for Oleg Nesterov

unread,

Jan 18, 2011, 2:10:02 PM1/18/11

to

Commit-ID: 66832eb4baaaa9abe4c993ddf9113a79e39b9915
Gitweb: http://git.kernel.org/tip/66832eb4baaaa9abe4c993ddf9113a79e39b9915
Author: Oleg Nesterov <ol...@redhat.com>
AuthorDate: Tue, 18 Jan 2011 17:10:32 +0100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Tue, 18 Jan 2011 19:34:23 +0100

perf: Validate cpu early in perf_event_alloc()

Starting from perf_event_alloc()->perf_init_event(), the kernel
assumes that event->cpu is either -1 or the valid CPU number.

Change perf_event_alloc() to validate this argument early. This
also means we can remove the similar check in
find_get_context().

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Frederic Weisbecker <fwei...@gmail.com>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Prasad <pra...@linux.vnet.ibm.com>
Cc: Roland McGrath <rol...@redhat.com>
Cc: gre...@suse.de
Cc: sta...@kernel.org
LKML-Reference: <2011011816...@redhat.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
kernel/perf_event.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a962b19..67d9bd7 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2233,9 +2233,6 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)

if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
return ERR_PTR(-EACCES);

- if (cpu < 0 || cpu >= nr_cpumask_bits)
- return ERR_PTR(-EINVAL);
-
/*
* We could be clever and allow to attach a event to an
* offline CPU and activate it when the CPU comes up, but

@@ -5541,6 +5538,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,

struct hw_perf_event *hwc;
long err;

+ if ((unsigned)cpu >= nr_cpu_ids) {
+ if (!task || cpu != -1)
+ return ERR_PTR(-EINVAL);
+ }
+
event = kzalloc(sizeof(*event), GFP_KERNEL);
if (!event)
return ERR_PTR(-ENOMEM);

@@ -5589,7 +5591,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,

tip-bot for Oleg Nesterov

unread,

Jan 18, 2011, 2:10:01 PM1/18/11

to

Commit-ID: 22a4ec729017ba613337a28f306f94ba5023fe2e
Gitweb: http://git.kernel.org/tip/22a4ec729017ba613337a28f306f94ba5023fe2e
Author: Oleg Nesterov <ol...@redhat.com>
AuthorDate: Tue, 18 Jan 2011 17:10:08 +0100

Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Tue, 18 Jan 2011 19:34:23 +0100

perf: Find_get_context: fix the per-cpu-counter check

If task == NULL, find_get_context() should always check that cpu
is correct.

Afaics, the bug was introduced by 38a81da2 "perf events: Clean
up pid passing", but even before that commit "&& cpu != -1" was
not exactly right, -ESRCH from find_task_by_vpid() is not
accurate.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>

Acked-by: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <ac...@redhat.com>
Cc: Frederic Weisbecker <fwei...@gmail.com>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Prasad <pra...@linux.vnet.ibm.com>
Cc: Roland McGrath <rol...@redhat.com>
Cc: gre...@suse.de
Cc: sta...@kernel.org
LKML-Reference: <2011011816...@redhat.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>

---
kernel/perf_event.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 76be4c7..a962b19 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -2228,7 +2228,7 @@ find_get_context(struct pmu *pmu, struct task_struct *task, int cpu)

Oleg Nesterov

unread,

Jan 19, 2011, 10:50:02 AM1/19/11

to

On 01/18, Frederic Weisbecker wrote:
>
> On Mon, Jan 17, 2011 at 09:34:59PM +0100, Oleg Nesterov wrote:
> >
> > Any idea how to fix this cleanly? May be we can reuse perf_event_mutex,
> > but this looks soooo ugly. And do_exit()->flush_ptrace_hw_breakpoint()
> > has the strange "FIXME:" comment which doesn't help me to understand
> > what can we do.
>
> Yeah forget about the FIXME, it's a stale thing I need to remove.

OK, good.

> > Probably the best fix is to change this code so that the tracer owns
> > ->ptrace_bps[], not the tracee. But this is not trivial, and needs a
> > lot of changes in ptrace code.
>
> How much complicated would it be?

The problem is, ptrace_detach/release_task can't sleep currently.
The necessary changes are nasty.

> Because I see three solutions to solve this:
>
> - Have a mutex inside thread->ptrace_bps. The contention must be
> rare and only concern ptrace and tracee exit. That's the simplest.

I think we can reuse perf_event_mutex for this. Not very good too,
but simple. But this depends on what can we do under this mutex...

I am going to report a couple more bugs (at least, it looks like
a bug when I am trying to understand the code ;), probably they
should be fixed first.

Oleg.

Frederic Weisbecker

unread,

Jan 19, 2011, 3:10:02 PM1/19/11

to

On Wed, Jan 19, 2011 at 04:37:46PM +0100, Oleg Nesterov wrote:
> On 01/18, Frederic Weisbecker wrote:
> > > Probably the best fix is to change this code so that the tracer owns
> > > ->ptrace_bps[], not the tracee. But this is not trivial, and needs a
> > > lot of changes in ptrace code.
> >
> > How much complicated would it be?
>
> The problem is, ptrace_detach/release_task can't sleep currently.
> The necessary changes are nasty.

Ok, let's forget that then :)

> > Because I see three solutions to solve this:
> >
> > - Have a mutex inside thread->ptrace_bps. The contention must be
> > rare and only concern ptrace and tracee exit. That's the simplest.
>
> I think we can reuse perf_event_mutex for this. Not very good too,
> but simple. But this depends on what can we do under this mutex...

That could work. I feel a bit uncomfortable to use a perf related
mutex for that though. I can't figure out any deadlock with the current
state, but if we are going to use that solution, perf events will be
created/destroyed/disabled/enabled under that mutex. May be that large
coverage might create some dependency problems in the future. I don't
know...

Dunno, that doesn't seem to be a good use of perf_event_mutex.

I had another idea based on a refcount. Can you have a look?
The drawback is to add an entry in task_struct. OTOH I can drop
more of them for the no-running-breakpoint case from thread_struct
in a subsequent task.

Note the problem touches more archs than x86. Basically every
arch that use breakpoint use a similar scheme that must be fixed.

(only compile tested yet)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 45892dc..08d79f9 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -608,6 +608,9 @@ static int ptrace_write_dr7(struct task_struct *tsk, unsigned long data)
unsigned len, type;
struct perf_event *bp;

+ if (ptrace_get_breakpoints(tsk) < 0)
+ return -ESRCH;
+
data &= ~DR_CONTROL_RESERVED;
old_dr7 = ptrace_get_dr7(thread->ptrace_bps);
restore:
@@ -655,6 +658,9 @@ restore:
}
goto restore;
}
+
+ ptrace_put_breakpoints(tsk);
+
return ((orig_ret < 0) ? orig_ret : rc);
}

@@ -668,10 +674,17 @@ static unsigned long ptrace_get_debugreg(struct task_struct *tsk, int n)

if (n < HBP_NUM) {
struct perf_event *bp;
+
+ if (ptrace_get_breakpoints(tsk) < 0)
+ return -ESRCH;
+
bp = thread->ptrace_bps[n];
if (!bp)
- return 0;
- val = bp->hw.info.address;
+ val = 0;
+ else
+ val = bp->hw.info.address;
+
+ ptrace_put_breakpoints(tsk);
} else if (n == 6) {
val = thread->debugreg6;
} else if (n == 7) {
@@ -686,6 +699,10 @@ static int ptrace_set_breakpoint_addr(struct task_struct *tsk, int nr,
struct perf_event *bp;
struct thread_struct *t = &tsk->thread;
struct perf_event_attr attr;
+ int err = 0;
+
+ if (ptrace_get_breakpoints(tsk) < 0)
+ return -ESRCH;

if (!t->ptrace_bps[nr]) {
ptrace_breakpoint_init(&attr);
@@ -709,24 +726,23 @@ static int ptrace_set_breakpoint_addr(struct task_struct *tsk, int nr,
* writing for the user. And anyway this is the previous
* behaviour.
*/
- if (IS_ERR(bp))
- return PTR_ERR(bp);
+ if (IS_ERR(bp)) {
+ err = PTR_ERR(bp);
+ goto put;
+ }

t->ptrace_bps[nr] = bp;
} else {
- int err;
-
bp = t->ptrace_bps[nr];

attr = bp->attr;
attr.bp_addr = addr;
err = modify_user_hw_breakpoint(bp, &attr);
- if (err)
- return err;
}

-
- return 0;
+put:
+ ptrace_put_breakpoints(tsk);
+ return err;
}

/*
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index 092a04f..519f03c 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -192,6 +192,9 @@ static inline void ptrace_init_task(struct task_struct *child, bool ptrace)
child->ptrace = current->ptrace;
__ptrace_link(child, current->parent);
}
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+ atomic_set(&child->ptrace_bp_refcnt, 1);
+#endif
}

/**
@@ -352,7 +355,12 @@ static inline void user_single_step_siginfo(struct task_struct *tsk,
extern int task_current_syscall(struct task_struct *target, long *callno,
unsigned long args[6], unsigned int maxargs,
unsigned long *sp, unsigned long *pc);
-
-#endif
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+extern int ptrace_get_breakpoints(struct task_struct *tsk);
+extern void ptrace_put_breakpoints(struct task_struct *tsk);
+#else
+static inline void ptrace_put_breakpoints(struct task_struct *tsk) { }
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
+#endif /* __KERNEL */

#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 777cd01..523a1c5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1526,6 +1526,9 @@ struct task_struct {
unsigned long memsw_bytes; /* uncharged mem+swap usage */
} memcg_batch;
#endif
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+ atomic_t ptrace_bp_refcnt;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/exit.c b/kernel/exit.c
index 8cb8904..5284464 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1015,7 +1015,7 @@ NORET_TYPE void do_exit(long code)
/*
* FIXME: do that only when needed, using sched_exit tracepoint
*/
- flush_ptrace_hw_breakpoint(tsk);
+ ptrace_put_breakpoints(tsk);

exit_notify(tsk, group_dead);
#ifdef CONFIG_NUMA
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 99bbaa3..23394f1 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -22,6 +22,7 @@
#include <linux/syscalls.h>
#include <linux/uaccess.h>
#include <linux/regset.h>
+#include <linux/hw_breakpoint.h>

/*
@@ -876,3 +877,19 @@ asmlinkage long compat_sys_ptrace(compat_long_t request, compat_long_t pid,
return ret;
}
#endif /* CONFIG_COMPAT */
+
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+int ptrace_get_breakpoints(struct task_struct *tsk)
+{
+ if (atomic_inc_not_zero(&tsk->ptrace_bp_refcnt))
+ return 0;
+
+ return -1;
+}
+
+void ptrace_put_breakpoints(struct task_struct *tsk)
+{
+ if (!atomic_dec_return(&tsk->ptrace_bp_refcnt))
+ flush_ptrace_hw_breakpoint(tsk);
+}
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */

Oleg Nesterov

unread,

Jan 20, 2011, 12:40:02 PM1/20/11

to

On 01/19, Frederic Weisbecker wrote:
>
> On Wed, Jan 19, 2011 at 04:37:46PM +0100, Oleg Nesterov wrote:
> >
> > I think we can reuse perf_event_mutex for this. Not very good too,
> > but simple. But this depends on what can we do under this mutex...
>
> That could work. I feel a bit uncomfortable to use a perf related
> mutex for that though. I can't figure out any deadlock with the current
> state, but if we are going to use that solution, perf events will be
> created/destroyed/disabled/enabled under that mutex.

No, I didn't mean create/destroy under that mutex, but

> Dunno, that doesn't seem to be a good use of perf_event_mutex.

I agree anyway.

> OTOH I can drop
> more of them for the no-running-breakpoint case from thread_struct
> in a subsequent task.

Hmm. Can't understand what do you mean. Just curious, could you explain?

> Note the problem touches more archs than x86. Basically every
> arch that use breakpoint use a similar scheme that must be fixed.

Yes. Perhaps we should try to unify some code... Say, can't we move
->ptrace_bps[] to task_struct?

> +void ptrace_put_breakpoints(struct task_struct *tsk)
> +{
> + if (!atomic_dec_return(&tsk->ptrace_bp_refcnt))
> + flush_ptrace_hw_breakpoint(tsk);

(minor nit, atomic_dec_and_test() looks more natural)

I think the patch is correct and should fix the problem.

Thanks!

Oleg.

Oleg Nesterov

unread,

Jan 18, 2011, 11:20:03 AM1/18/11

to

On 01/17, Peter Zijlstra wrote:
>
> Wasn't this sorted by: 8882135bcd332f294df5455747ea43ba9e6f77ad?

No. This commit only answers my 3rd question, 1-2 are still waiting ;)

> > I am reading perf_event.c, but all I found so far is a couple of trivial
> > methods to crash the kernel via sys_perf_event_open(), will report
> > tomorrow...
>
> Ow, that's not too pretty..

Fortunately, this is trivial. Probably 2.6.37 needs these fixes too.

Oleg.

Frederic Weisbecker

unread,

Jan 28, 2011, 12:50:02 PM1/28/11

to

On Thu, Jan 20, 2011 at 06:28:10PM +0100, Oleg Nesterov wrote:
> On 01/19, Frederic Weisbecker wrote:
> > OTOH I can drop
> > more of them for the no-running-breakpoint case from thread_struct
> > in a subsequent task.
>
> Hmm. Can't understand what do you mean. Just curious, could you explain?

Indeed now that I read that, it was completely not understandable :)

So I meant that currently we have this:

task->thread->ptrace_bps[BP_NUM]

Where ptrace_bps is:

struct perf_event *ptrace_bps[BP_NUM];

And we populate that with pointers when needed. Now this is a waste
of space, I should better make it:

struct perf_event **ptrace_bps;

And only allocate the pointer space when needed.

> > Note the problem touches more archs than x86. Basically every
> > arch that use breakpoint use a similar scheme that must be fixed.
>
> Yes. Perhaps we should try to unify some code... Say, can't we move
> ->ptrace_bps[] to task_struct?

It seems that every archs that currently implement breakpoints have
this linear mapping of registers, even when physically they are not
linear: ARM has a seperate register space for instruction and data
breakpoints for example.

So yeah it seems we can store that in task_struct. I may try that
in a subsequent patch.

>
> > +void ptrace_put_breakpoints(struct task_struct *tsk)
> > +{
> > + if (!atomic_dec_return(&tsk->ptrace_bp_refcnt))
> > + flush_ptrace_hw_breakpoint(tsk);
>
> (minor nit, atomic_dec_and_test() looks more natural)

Indeed, will change that.

Thanks!

Q: perf_event && task->ptrace_bps[]

Oleg Nesterov

Oleg Nesterov

Frederic Weisbecker

Oleg Nesterov

Frederic Weisbecker

Peter Zijlstra

Oleg Nesterov

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

tip-bot for Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Frederic Weisbecker

Frederic Weisbecker

Roland McGrath

Oleg Nesterov

Oleg Nesterov

Frederic Weisbecker

tip-bot for Oleg Nesterov

tip-bot for Oleg Nesterov

Oleg Nesterov

Frederic Weisbecker

Oleg Nesterov

Oleg Nesterov

Frederic Weisbecker