[PATCH 2/6] move_task_off_dead_cpu: take rq->lock around select_fallback

Oleg Nesterov

unread,

Mar 15, 2010, 5:20:02 AM3/15/10

to

move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed
lockless. We can race with set_cpus_allowed() running in parallel.

Change it to take rq->lock around select_fallback_rq(). Note that it is not
trivial to move this spin_lock() into select_fallback_rq(), we must recheck
the task was not migrated after we take the lock and other callers do not
need this lock.

To avoid the races with other callers of select_fallback_rq() which rely on
TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise.
The owner of TASK_WAKING must update ->cpus_allowed and choose the correct
CPU anyway, and the subsequent __migrate_task() is just meaningless because
p->se.on_rq must be false.

Alternatively, we could change select_task_rq() to take rq->lock right
after it calls sched_class->select_task_rq(), but this looks a bit ugly.

Also, change it to not assume irqs are disabled and absorb __migrate_task_irq().

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

kernel/sched.c | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)

--- 34-rc1/kernel/sched.c~2_MTODC_TAKE_RQ_LOCK 2010-03-15 09:40:16.000000000 +0100
+++ 34-rc1/kernel/sched.c 2010-03-15 09:40:31.000000000 +0100
@@ -5509,29 +5509,29 @@ static int migration_thread(void *data)
}

#ifdef CONFIG_HOTPLUG_CPU
-
-static int __migrate_task_irq(struct task_struct *p, int src_cpu, int dest_cpu)
-{
- int ret;
-
- local_irq_disable();
- ret = __migrate_task(p, src_cpu, dest_cpu);
- local_irq_enable();
- return ret;
-}
-
/*
* Figure out where task on dead CPU should go, use force if necessary.
*/
static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
{
- int dest_cpu;
-
+ struct rq *rq = cpu_rq(dead_cpu);
+ int needs_cpu, dest_cpu;
+ unsigned long flags;
again:
- dest_cpu = select_fallback_rq(dead_cpu, p);
+ local_irq_save(flags);
+
+ raw_spin_lock(&rq->lock);
+ needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);
+ if (needs_cpu)
+ dest_cpu = select_fallback_rq(dead_cpu, p);
+ raw_spin_unlock(&rq->lock);

/* It can have affinity changed while we were choosing. */
- if (unlikely(!__migrate_task_irq(p, dead_cpu, dest_cpu)))
+ if (needs_cpu)
+ needs_cpu = !__migrate_task(p, dead_cpu, dest_cpu);
+ local_irq_restore(flags);
+
+ if (unlikely(needs_cpu))
goto again;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,

Mar 24, 2010, 11:50:01 AM3/24/10

to

On Mon, 2010-03-15 at 10:10 +0100, Oleg Nesterov wrote:
> static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
> {

> + struct rq *rq = cpu_rq(dead_cpu);
> + int needs_cpu, dest_cpu;
> + unsigned long flags;
> again:

> + local_irq_save(flags);
> +
> + raw_spin_lock(&rq->lock);
> + needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);

^
kernel/sched.c:5445: warning: ‘dest_cpu’ may be used uninitialized in this function

> + if (needs_cpu)
> + dest_cpu = select_fallback_rq(dead_cpu, p);

Oleg Nesterov

unread,

Mar 24, 2010, 12:10:03 PM3/24/10

to

On 03/24, Peter Zijlstra wrote:
>
> On Mon, 2010-03-15 at 10:10 +0100, Oleg Nesterov wrote:
> > static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
> > {
> > + struct rq *rq = cpu_rq(dead_cpu);
> > + int needs_cpu, dest_cpu;
> > + unsigned long flags;
> > again:
> > + local_irq_save(flags);
> > +
> > + raw_spin_lock(&rq->lock);
> > + needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);
>
> ^
> kernel/sched.c:5445: warning: ‘dest_cpu’ may be used uninitialized in this function

Hmm. looks like my gcc is more friendly...

OK. certainly I'll send the updated patch, if this series passes
your review otherwise.

Oleg.

Peter Zijlstra

unread,

Mar 24, 2010, 12:20:02 PM3/24/10

to

On Wed, 2010-03-24 at 17:07 +0100, Oleg Nesterov wrote:
> On 03/24, Peter Zijlstra wrote:
> >
> > On Mon, 2010-03-15 at 10:10 +0100, Oleg Nesterov wrote:
> > > static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
> > > {
> > > + struct rq *rq = cpu_rq(dead_cpu);
> > > + int needs_cpu, dest_cpu;
> > > + unsigned long flags;
> > > again:
> > > + local_irq_save(flags);
> > > +
> > > + raw_spin_lock(&rq->lock);
> > > + needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);
> >
> > ^
> > kernel/sched.c:5445: warning: ‘dest_cpu’ may be used uninitialized in this function
>
> Hmm. looks like my gcc is more friendly...

Hrm, that and I'm apparently unable to read, it said dest_cpu, not
dead_cpu.. a well, I'll slam an __maybe_unused in.

> OK. certainly I'll send the updated patch, if this series passes
> your review otherwise.

Yeah, you made a few good points in 0/6, am now staring at the code on
how to close those holes, hope to post something sensible soon.

Oleg Nesterov

unread,

Mar 24, 2010, 12:40:03 PM3/24/10

to

On 03/24, Peter Zijlstra wrote:
>
> Yeah, you made a few good points in 0/6, am now staring at the code on
> how to close those holes, hope to post something sensible soon.

Yes, great.

Speaking of 0/6, I forgot to ask a couple more question...

try_to_wake_up() does task_rq_lock() which checks TASK_WAKING. Perhaps
it shouldn't ? I mean, perhaps try_to_wake_up() can take rq->lock without
checking task->state. It can never race with the owner of TASK_WAKING,
before anything else we check "p->state & state".

And. Without the change above, any owner of TASK_WAKING must disable
preemption and clear irqs.

What do you think?

And a stupid question. While doing these changes I was really, really
puzzled by task_rq_lock() which does

local_irq_save(*flags);
rq = task_rq(p);
raw_spin_lock(&rq->lock);

to the point, I even tried to read the comment which says:

Note the ordering: we can safely lookup the task_rq without
explicitly disabling preemption.

Could you please explain what does this mean? IOW, why can't we do

rq = task_rq(p);
raw_spin_lock_irqsave(&rq->lock, flags);

instead?

Of course, this doesn't really matter, but I'd like to understand
what I have missed here.

Thanks,

Oleg.

Peter Zijlstra

unread,

Mar 26, 2010, 5:10:02 AM3/26/10

to

On Wed, 2010-03-24 at 17:33 +0100, Oleg Nesterov wrote:
> On 03/24, Peter Zijlstra wrote:
> >
> > Yeah, you made a few good points in 0/6, am now staring at the code on
> > how to close those holes, hope to post something sensible soon.
>
> Yes, great.
>
> Speaking of 0/6, I forgot to ask a couple more question...
>
> try_to_wake_up() does task_rq_lock() which checks TASK_WAKING. Perhaps
> it shouldn't ? I mean, perhaps try_to_wake_up() can take rq->lock without
> checking task->state. It can never race with the owner of TASK_WAKING,
> before anything else we check "p->state & state".

You're right, but creating a special task_rq_lock() for ttwu() went a
little far, but now that we can remove all that again, this too should
be good again.

> And a stupid question. While doing these changes I was really, really
> puzzled by task_rq_lock() which does
>
> local_irq_save(*flags);
> rq = task_rq(p);
> raw_spin_lock(&rq->lock);
>
> to the point, I even tried to read the comment which says:
>
> Note the ordering: we can safely lookup the task_rq without
> explicitly disabling preemption.
>
> Could you please explain what does this mean? IOW, why can't we do
>
> rq = task_rq(p);
> raw_spin_lock_irqsave(&rq->lock, flags);
>
> instead?

I'm not sure why that is the case, v2.6.14:kernel/sched.c already has
that. Ingo can you remember any reason for this or should we change the
code like Oleg suggests?

tip-bot for Oleg Nesterov

unread,

Apr 2, 2010, 3:20:02 PM4/2/10

to

Commit-ID: 1445c08d06c5594895b4fae952ef8a457e89c390
Gitweb: http://git.kernel.org/tip/1445c08d06c5594895b4fae952ef8a457e89c390
Author: Oleg Nesterov <ol...@redhat.com>
AuthorDate: Mon, 15 Mar 2010 10:10:10 +0100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Fri, 2 Apr 2010 20:12:01 +0200

sched: move_task_off_dead_cpu(): Take rq->lock around select_fallback_rq()

move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed
lockless. We can race with set_cpus_allowed() running in parallel.

Change it to take rq->lock around select_fallback_rq(). Note that it is not
trivial to move this spin_lock() into select_fallback_rq(), we must recheck
the task was not migrated after we take the lock and other callers do not
need this lock.

To avoid the races with other callers of select_fallback_rq() which rely on
TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise.
The owner of TASK_WAKING must update ->cpus_allowed and choose the correct
CPU anyway, and the subsequent __migrate_task() is just meaningless because
p->se.on_rq must be false.

Alternatively, we could change select_task_rq() to take rq->lock right
after it calls sched_class->select_task_rq(), but this looks a bit ugly.

Also, change it to not assume irqs are disabled and absorb __migrate_task_irq().

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
LKML-Reference: <2010031509...@redhat.com>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
kernel/sched.c | 30 +++++++++++++++---------------
1 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index c0b3ebc..27774b5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5448,29 +5448,29 @@ static int migration_thread(void *data)

}

#ifdef CONFIG_HOTPLUG_CPU
-
-static int __migrate_task_irq(struct task_struct *p, int src_cpu, int dest_cpu)
-{
- int ret;
-
- local_irq_disable();
- ret = __migrate_task(p, src_cpu, dest_cpu);
- local_irq_enable();
- return ret;
-}
-
/*
* Figure out where task on dead CPU should go, use force if necessary.
*/
static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
{
- int dest_cpu;
-
+ struct rq *rq = cpu_rq(dead_cpu);

+ int needs_cpu, uninitialized_var(dest_cpu);

[PATCH 2/6] move_task_off_dead_cpu: take rq->lock around select_fallback_rq()

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

tip-bot for Oleg Nesterov