Race condition in ptrace

Bodo Stroesser

unread,

Feb 3, 2005, 7:44:39 AM2/3/05

to Roland Mc Grath, Jeff Dike, BlaisorBlade, user-mode-linux devel, linux-...@vger.kernel.org

Working with the new UML skas0 mode on my Xeon HT host, sporadically I saw
some processes on UML segfaulting.

In all cases, I could track this down to be caused by a gs segment register,
that had the wrong contents.

This again is caused by a problem in the host linux: A ptraced child going to
stop and having woken up its parent, will save some of its registers (on i386
they are fs, gs and the fp-registers) very late in switch_to. The parent is
granted access to child's registers as soon, as the child is removed from
the runqueue. Thus, in rare cases, the parent might access child's register
savearea before the registers really are saved.

This problem might also be the reason for problems with floatpoint on UML,
that were reported some time ago.

I've written a test program, that reproduces the problem on my 2.6.9 vanilla
host quite quick. Using SuSE kernel 2.6.5-7.97-smp, I can't reproduce the
problem, although the relevant parts seem to be unchanged. Maybe not related
changes modify the timing?

I also created a patch, that fixes the problem on my 2.6.9 host. This probably
isn't a sane patch, but is enough to demonstrate, where I think, the bug is.
Both files are attached.

Bodo

fix-ptrace-race.patch

test_ptrace.c

Nick Piggin

unread,

Feb 3, 2005, 7:31:40 PM2/3/05

to Bodo Stroesser, Roland Mc Grath, Jeff Dike, BlaisorBlade, user-mode-linux devel, linux-...@vger.kernel.org

> ------------------------------------------------------------------------
>
> --- a/include/linux/sched.h 2005-02-02 22:15:51.000000000 +0100
> +++ b/include/linux/sched.h 2005-02-02 22:22:54.000000000 +0100
> @@ -584,6 +584,7 @@ struct task_struct {
> struct mempolicy *mempolicy;
> short il_next; /* could be shared with used_math */
> #endif
> + volatile long saving;
> };
>
> static inline pid_t process_group(struct task_struct *tsk)
> --- a/kernel/sched.c 2005-02-02 21:32:51.000000000 +0100
> +++ b/kernel/sched.c 2005-02-02 22:12:14.000000000 +0100
> @@ -2689,8 +2689,10 @@ need_resched:
> if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
> unlikely(signal_pending(prev))))
> prev->state = TASK_RUNNING;
> - else
> + else {
> + prev->saving = 1;
> deactivate_task(prev, rq);
> + }
> }
>
> cpu = smp_processor_id();
> --- a/kernel/ptrace.c 2005-02-02 22:12:33.000000000 +0100
> +++ b/kernel/ptrace.c 2005-02-02 22:20:46.000000000 +0100
> @@ -96,6 +96,7 @@ int ptrace_check_attach(struct task_stru
>
> if (!ret && !kill) {
> wait_task_inactive(child);
> + while ( child->saving ) ;
> }
>
> /* All systems go.. */
> --- a/arch/i386/kernel/process.c 2005-02-02 22:18:29.000000000 +0100
> +++ b/arch/i386/kernel/process.c 2005-02-02 22:19:22.000000000 +0100
> @@ -577,6 +577,9 @@ struct task_struct fastcall * __switch_t
> asm volatile("movl %%fs,%0":"=m" (*(int *)&prev->fs));
> asm volatile("movl %%gs,%0":"=m" (*(int *)&prev->gs));
>
> + wmb();
> + prev_p->saving=0;
> +
> /*
> * Restore %fs and %gs if needed.
> */
>

I don't see how this could help because AFAIKS, child->saving is only
set and cleared while the runqueue is locked. And the same runqueue lock
is taken by wait_task_inactive.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Bodo Stroesser

unread,

Feb 4, 2005, 7:24:24 AM2/4/05

to Nick Piggin, Roland Mc Grath, Jeff Dike, BlaisorBlade, user-mode-linux devel, linux-...@vger.kernel.org

Sorry, that not right. There are some routines called by sched(), that release
and reacquire the runqueue lock.

Bodo

Andrew Morton

unread,

Feb 4, 2005, 6:05:32 PM2/4/05

to Nick Piggin, bstro...@fujitsu-siemens.com, rol...@redhat.com, jd...@addtoit.com, blaisorb...@yahoo.it, user-mode-...@lists.sourceforge.net, linux-...@vger.kernel.org

Nick Piggin <nickp...@yahoo.com.au> wrote:

> Oh yeah, it is the wake_sleeping_dependent / dependent_sleeper crap.
> Sorry, you are right. And that's definitely a bug in sched.c, because
> it breaks wait_task_inactive, as you've rightly observed.
>
> Andrew, IMO this is another bug to hold 2.6.11 for.

Sure. I wouldn't consider Bodo's patch to be the one to use though..

Nick Piggin

unread,

Feb 4, 2005, 6:19:58 PM2/4/05

to Andrew Morton, bstro...@fujitsu-siemens.com, rol...@redhat.com, jd...@addtoit.com, blaisorb...@yahoo.it, user-mode-...@lists.sourceforge.net, linux-...@vger.kernel.org

Andrew Morton wrote:
> Nick Piggin <nickp...@yahoo.com.au> wrote:
>
>>Bodo Stroesser wrote:
>>
>>>Nick Piggin wrote:
>>>
>>>
>>>>Bodo Stroesser wrote:
>>
>>>>I don't see how this could help because AFAIKS, child->saving is only
>>>>set and cleared while the runqueue is locked. And the same runqueue lock
>>>>is taken by wait_task_inactive.
>>>>
>>>
>>>Sorry, that not right. There are some routines called by sched(), that
>>>release
>>>and reacquire the runqueue lock.
>>>
>>
>>Oh yeah, it is the wake_sleeping_dependent / dependent_sleeper crap.
>>Sorry, you are right. And that's definitely a bug in sched.c, because
>>it breaks wait_task_inactive, as you've rightly observed.
>>
>>Andrew, IMO this is another bug to hold 2.6.11 for.
>
>
> Sure. I wouldn't consider Bodo's patch to be the one to use though..

No. Something similar could be done that works on all architectures
and all wait_task_inactive callers (and is confined to sched.c). That
would still be more or less a hack to work around smtnice's unfortunate
locking though.

Nick Piggin

unread,

Feb 4, 2005, 7:41:16 PM2/4/05

to Bodo Stroesser, Roland Mc Grath, Jeff Dike, BlaisorBlade, user-mode-linux devel, linux-...@vger.kernel.org, Andrew Morton

Oh yeah, it is the wake_sleeping_dependent / dependent_sleeper crap.

Sorry, you are right. And that's definitely a bug in sched.c, because
it breaks wait_task_inactive, as you've rightly observed.

Andrew, IMO this is another bug to hold 2.6.11 for.

Nick Piggin

unread,

Feb 4, 2005, 11:39:30 PM2/4/05

to Andrew Morton, bstro...@fujitsu-siemens.com, rol...@redhat.com, jd...@addtoit.com, blaisorb...@yahoo.it, user-mode-...@lists.sourceforge.net, linux-...@vger.kernel.org

Nick Piggin wrote:
> Andrew Morton wrote:
>
>> Nick Piggin <nickp...@yahoo.com.au> wrote:
>>

>>> Andrew, IMO this is another bug to hold 2.6.11 for.
>>
>>
>>
>> Sure. I wouldn't consider Bodo's patch to be the one to use though..
>
>
> No. Something similar could be done that works on all architectures
> and all wait_task_inactive callers (and is confined to sched.c). That
> would still be more or less a hack to work around smtnice's unfortunate
> locking though.
>

Something like the following (untested) extension of Bodo's work
could be the minimal fix for 2.6.11. As I've said though, I'd
consider it a hack and prefer to do something about the locking.
That could be done after 2.6.11 though. Depends how you feel.

Bodo, I wonder if this looks like a suitable fix for your problem?

sched-fixup-locking.patch

Nick Piggin

unread,

Feb 5, 2005, 10:29:29 PM2/5/05

to Andrew Morton, bstro...@fujitsu-siemens.com, rol...@redhat.com, jd...@addtoit.com, blaisorb...@yahoo.it, user-mode-...@lists.sourceforge.net, linux-...@vger.kernel.org, Ingo Molnar

Nick Piggin wrote:

> Something like the following (untested) extension of Bodo's work
> could be the minimal fix for 2.6.11. As I've said though, I'd
> consider it a hack and prefer to do something about the locking.
> That could be done after 2.6.11 though. Depends how you feel.
>

I think this is the right fix.

When a task is put to sleep, it is dequeued from the runqueue
while it is still running. The problem is that the runqueue
lock can be dropped and retaken in schedule() before the task
actually schedules off, and wait_task_inactive did not account
for this.

I introduced a new function to resolve this state, fixed
wait_task_inactive, and converted over an open coded test.

I did a quick audit of sched.c, and nothing else seems to have
made the same mistake.

Signed-off-by: Nick Piggin <nickp...@yahoo.com.au>

Question: why does wait_task_inactive have different semantics
for UP && PREEMPT than SMP && PREEMPT? I can see that the kthread
caler probably isn't used in the UP case, but technically it is
relying on behaviour that it doesn't get with UP and PREEMPT.
Looks like the ptrace.c caller won't care.

But still, can we either fix it or put a nice comment there?
Preferably fix, if this isn't a very performance critical path?

sched-fixup-races.patch

Ingo Molnar

unread,

Feb 6, 2005, 2:23:56 AM2/6/05

to Nick Piggin, Andrew Morton, bstro...@fujitsu-siemens.com, rol...@redhat.com, jd...@addtoit.com, blaisorb...@yahoo.it, user-mode-...@lists.sourceforge.net, linux-...@vger.kernel.org

* Nick Piggin <nickp...@yahoo.com.au> wrote:

> When a task is put to sleep, it is dequeued from the runqueue
> while it is still running. The problem is that the runqueue
> lock can be dropped and retaken in schedule() before the task
> actually schedules off, and wait_task_inactive did not account
> for this.

ugh. This has been the Nth time we got bitten by the fundamental
unrobustness of non-atomic scheduling on some architectures ...
(And i'll say the N+1th time that this is not good.)

> +static int task_onqueue(runqueue_t *rq, task_t *p)
> +{
> + return (p->array || task_running(rq, p));
> +}

the fix looks good, but i'd go for the simplified oneliner patch below.
I dont like the name 'task_onqueue()', a because a task is 'on the
queue' when it has p->array != NULL. The task is running when it's
task_running(). On architectures with nonatomic scheduling a task may
be running while not on the queue and external observers with the
runqueue lock might notice this - and wait_task_inactive() has to take
care of this case. I'm not sure how introducing a function named
"task_onqueue()" will make the situation any clearer.

ok?

Ingo

--

When a task is put to sleep, it is dequeued from the runqueue

while it is still running. The problem is that one some arches
that has non-atomic scheduling, the runqueue lock can be

dropped and retaken in schedule() before the task actually
schedules off, and wait_task_inactive did not account for this.

Signed-off-by: Nick Piggin <nickp...@yahoo.com.au>
Signed-off-by: Ingo Molnar <mi...@elte.hu>

---

linux/kernel/sched.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -867,7 +875,7 @@ void wait_task_inactive(task_t * p)
repeat:
rq = task_rq_lock(p, &flags);
/* Must be off runqueue entirely, not preempted. */
- if (unlikely(p->array)) {
+ if (unlikely(p->array || task_running(rq, p))) {
/* If it's preempted, we yield. It could be a while. */
preempted = !task_running(rq, p);
task_rq_unlock(rq, &flags);

Nick Piggin

unread,

Feb 6, 2005, 2:40:01 AM2/6/05

to Ingo Molnar, Andrew Morton, bstro...@fujitsu-siemens.com, rol...@redhat.com, jd...@addtoit.com, blaisorb...@yahoo.it, user-mode-...@lists.sourceforge.net, linux-...@vger.kernel.org

Ingo Molnar wrote:
> * Nick Piggin <nickp...@yahoo.com.au> wrote:
>
>
>>When a task is put to sleep, it is dequeued from the runqueue
>>while it is still running. The problem is that the runqueue
>>lock can be dropped and retaken in schedule() before the task
>>actually schedules off, and wait_task_inactive did not account
>>for this.
>
>
> ugh. This has been the Nth time we got bitten by the fundamental
> unrobustness of non-atomic scheduling on some architectures ...
> (And i'll say the N+1th time that this is not good.)
>

This is actually due to wake_sleeping_dependent and
dependent_sleeper dropping the runqueue lock.

Actually idle_balance can (in rare cases) drop the lock as well,
which I didn't notice earlier, so it is something that we
have been doing forever. The smtnice locking has just exposed
the problem for us, so I wrongfully bashed it earlier *blush*.

>
>>+static int task_onqueue(runqueue_t *rq, task_t *p)
>>+{
>>+ return (p->array || task_running(rq, p));
>>+}
>
>
> the fix looks good, but i'd go for the simplified oneliner patch below.
> I dont like the name 'task_onqueue()', a because a task is 'on the
> queue' when it has p->array != NULL. The task is running when it's
> task_running(). On architectures with nonatomic scheduling a task may
> be running while not on the queue and external observers with the
> runqueue lock might notice this - and wait_task_inactive() has to take
> care of this case. I'm not sure how introducing a function named
> "task_onqueue()" will make the situation any clearer.
>
> ok?
>

Well just because there is a specific condition that both callsites
require. That is, the task is neither running nor on the runqueue.

While task_onqueue is technically wrong if you're looking down into
the fine details of the priority queue management, I think it is OK
to go up a level of abstraction and say that the task is
"on the runqueue" if it is either running or waiting to run.

It is really the one condition that is made un-intuitive due to the
locking in schedule(), so I thought formalising it would be better.
Suggestions for a better name welcome? ;)

Your one liner would fix the problem too, of course. The important
thing at this stage is that it gets fixed for 2.6.11.

Thanks,
Nick

Nick Piggin

unread,

Feb 6, 2005, 2:51:46 AM2/6/05

to Ingo Molnar, Andrew Morton, bstro...@fujitsu-siemens.com, rol...@redhat.com, jd...@addtoit.com, blaisorb...@yahoo.it, user-mode-...@lists.sourceforge.net, linux-...@vger.kernel.org

Nick Piggin wrote:
> Ingo Molnar wrote:
>
>> * Nick Piggin <nickp...@yahoo.com.au> wrote:
>>
>>
>>> When a task is put to sleep, it is dequeued from the runqueue
>>> while it is still running. The problem is that the runqueue
>>> lock can be dropped and retaken in schedule() before the task
>>> actually schedules off, and wait_task_inactive did not account
>>> for this.
>>
>>
>>
>> ugh. This has been the Nth time we got bitten by the fundamental
>> unrobustness of non-atomic scheduling on some architectures ...
>> (And i'll say the N+1th time that this is not good.)
>>
>
> This is actually due to wake_sleeping_dependent and
> dependent_sleeper dropping the runqueue lock.
>

Hmph, *and* unlocked context switch architectures as you say.
In fact, I'm surprised those haven't been bitten by this problem
earlier.

So that makes us each half right! :)

Bodo Stroesser

unread,

Feb 14, 2005, 11:10:14 AM2/14/05

to Nick Piggin, Ingo Molnar, Andrew Morton, rol...@redhat.com, jd...@addtoit.com, blaisorb...@yahoo.it, user-mode-...@lists.sourceforge.net, linux-...@vger.kernel.org

Nick Piggin wrote:
> Your one liner would fix the problem too, of course. The important
> thing at this stage is that it gets fixed for 2.6.11.

Sorry, have been off the net last week.

Thank you for the patches. Have tested Ingo's one liner.
It works fine for me, as expected.

Bodo