In all cases, I could track this down to be caused by a gs segment register,
that had the wrong contents.
This again is caused by a problem in the host linux: A ptraced child going to
stop and having woken up its parent, will save some of its registers (on i386
they are fs, gs and the fp-registers) very late in switch_to. The parent is
granted access to child's registers as soon, as the child is removed from
the runqueue. Thus, in rare cases, the parent might access child's register
savearea before the registers really are saved.
This problem might also be the reason for problems with floatpoint on UML,
that were reported some time ago.
I've written a test program, that reproduces the problem on my 2.6.9 vanilla
host quite quick. Using SuSE kernel 2.6.5-7.97-smp, I can't reproduce the
problem, although the relevant parts seem to be unchanged. Maybe not related
changes modify the timing?
I also created a patch, that fixes the problem on my 2.6.9 host. This probably
isn't a sane patch, but is enough to demonstrate, where I think, the bug is.
Both files are attached.
Bodo
I don't see how this could help because AFAIKS, child->saving is only
set and cleared while the runqueue is locked. And the same runqueue lock
is taken by wait_task_inactive.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Sorry, that not right. There are some routines called by sched(), that release
and reacquire the runqueue lock.
Bodo
Sure. I wouldn't consider Bodo's patch to be the one to use though..
No. Something similar could be done that works on all architectures
and all wait_task_inactive callers (and is confined to sched.c). That
would still be more or less a hack to work around smtnice's unfortunate
locking though.
Oh yeah, it is the wake_sleeping_dependent / dependent_sleeper crap.
Sorry, you are right. And that's definitely a bug in sched.c, because
it breaks wait_task_inactive, as you've rightly observed.
Andrew, IMO this is another bug to hold 2.6.11 for.
>>> Andrew, IMO this is another bug to hold 2.6.11 for.
>>
>>
>>
>> Sure. I wouldn't consider Bodo's patch to be the one to use though..
>
>
> No. Something similar could be done that works on all architectures
> and all wait_task_inactive callers (and is confined to sched.c). That
> would still be more or less a hack to work around smtnice's unfortunate
> locking though.
>
Something like the following (untested) extension of Bodo's work
could be the minimal fix for 2.6.11. As I've said though, I'd
consider it a hack and prefer to do something about the locking.
That could be done after 2.6.11 though. Depends how you feel.
Bodo, I wonder if this looks like a suitable fix for your problem?
> Something like the following (untested) extension of Bodo's work
> could be the minimal fix for 2.6.11. As I've said though, I'd
> consider it a hack and prefer to do something about the locking.
> That could be done after 2.6.11 though. Depends how you feel.
>
I think this is the right fix.
When a task is put to sleep, it is dequeued from the runqueue
while it is still running. The problem is that the runqueue
lock can be dropped and retaken in schedule() before the task
actually schedules off, and wait_task_inactive did not account
for this.
I introduced a new function to resolve this state, fixed
wait_task_inactive, and converted over an open coded test.
I did a quick audit of sched.c, and nothing else seems to have
made the same mistake.
Signed-off-by: Nick Piggin <nickp...@yahoo.com.au>
Question: why does wait_task_inactive have different semantics
for UP && PREEMPT than SMP && PREEMPT? I can see that the kthread
caler probably isn't used in the UP case, but technically it is
relying on behaviour that it doesn't get with UP and PREEMPT.
Looks like the ptrace.c caller won't care.
But still, can we either fix it or put a nice comment there?
Preferably fix, if this isn't a very performance critical path?
> When a task is put to sleep, it is dequeued from the runqueue
> while it is still running. The problem is that the runqueue
> lock can be dropped and retaken in schedule() before the task
> actually schedules off, and wait_task_inactive did not account
> for this.
ugh. This has been the Nth time we got bitten by the fundamental
unrobustness of non-atomic scheduling on some architectures ...
(And i'll say the N+1th time that this is not good.)
> +static int task_onqueue(runqueue_t *rq, task_t *p)
> +{
> + return (p->array || task_running(rq, p));
> +}
the fix looks good, but i'd go for the simplified oneliner patch below.
I dont like the name 'task_onqueue()', a because a task is 'on the
queue' when it has p->array != NULL. The task is running when it's
task_running(). On architectures with nonatomic scheduling a task may
be running while not on the queue and external observers with the
runqueue lock might notice this - and wait_task_inactive() has to take
care of this case. I'm not sure how introducing a function named
"task_onqueue()" will make the situation any clearer.
ok?
Ingo
--
When a task is put to sleep, it is dequeued from the runqueue
while it is still running. The problem is that one some arches
that has non-atomic scheduling, the runqueue lock can be
dropped and retaken in schedule() before the task actually
schedules off, and wait_task_inactive did not account for this.
Signed-off-by: Nick Piggin <nickp...@yahoo.com.au>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
linux/kernel/sched.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -867,7 +875,7 @@ void wait_task_inactive(task_t * p)
repeat:
rq = task_rq_lock(p, &flags);
/* Must be off runqueue entirely, not preempted. */
- if (unlikely(p->array)) {
+ if (unlikely(p->array || task_running(rq, p))) {
/* If it's preempted, we yield. It could be a while. */
preempted = !task_running(rq, p);
task_rq_unlock(rq, &flags);
This is actually due to wake_sleeping_dependent and
dependent_sleeper dropping the runqueue lock.
Actually idle_balance can (in rare cases) drop the lock as well,
which I didn't notice earlier, so it is something that we
have been doing forever. The smtnice locking has just exposed
the problem for us, so I wrongfully bashed it earlier *blush*.
>
>>+static int task_onqueue(runqueue_t *rq, task_t *p)
>>+{
>>+ return (p->array || task_running(rq, p));
>>+}
>
>
> the fix looks good, but i'd go for the simplified oneliner patch below.
> I dont like the name 'task_onqueue()', a because a task is 'on the
> queue' when it has p->array != NULL. The task is running when it's
> task_running(). On architectures with nonatomic scheduling a task may
> be running while not on the queue and external observers with the
> runqueue lock might notice this - and wait_task_inactive() has to take
> care of this case. I'm not sure how introducing a function named
> "task_onqueue()" will make the situation any clearer.
>
> ok?
>
Well just because there is a specific condition that both callsites
require. That is, the task is neither running nor on the runqueue.
While task_onqueue is technically wrong if you're looking down into
the fine details of the priority queue management, I think it is OK
to go up a level of abstraction and say that the task is
"on the runqueue" if it is either running or waiting to run.
It is really the one condition that is made un-intuitive due to the
locking in schedule(), so I thought formalising it would be better.
Suggestions for a better name welcome? ;)
Your one liner would fix the problem too, of course. The important
thing at this stage is that it gets fixed for 2.6.11.
Thanks,
Nick
Hmph, *and* unlocked context switch architectures as you say.
In fact, I'm surprised those haven't been bitten by this problem
earlier.
So that makes us each half right! :)
Sorry, have been off the net last week.
Thank you for the patches. Have tested Ingo's one liner.
It works fine for me, as expected.
Bodo