Q: select_fallback_rq() && cpuset

Oleg Nesterov

unread,

Mar 9, 2010, 1:10:01 PM3/9/10

to

Hello.

I tried to remove the deadlockable cpuset_lock() many times, but my
attempts were ignored by cpuset maintainers ;)

In particular, see http://marc.info/?l=linux-kernel&m=125261083613103

But now I have another question. Since 5da9a0fb673a0ea0a093862f95f6b89b3390c31e
cpuset_cpus_allowed_locked() is called without callback_mutex held by
try_to_wake_up().

And, without callback_mutex held, isn't it possible to race with, say,
update_cpumask() which changes cpuset->cpus_allowed? Yes, update_tasks_cpumask()
should fixup task->cpus_allowed later. But isn't it possible (at least
in theory) that try_to_wake_up() gets, say, all-zeroes in task->cpus_allowed
after select_fallback_rq()->cpuset_cpus_allowed_locked() if we race with
update_cpumask()->cpumask_copy() ?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,

Mar 10, 2010, 11:50:02 AM3/10/10

to

On Tue, 2010-03-09 at 19:06 +0100, Oleg Nesterov wrote:
> Hello.
>
> I tried to remove the deadlockable cpuset_lock() many times, but my
> attempts were ignored by cpuset maintainers ;)

Yeah, this appears to be an issue, there's no real maintainer atm, parts
are done by the sched folks, parts by the cgroup folks, and I guess
neither really knows everything..

> In particular, see http://marc.info/?l=linux-kernel&m=125261083613103

/me puts it on the to-review stack.

> But now I have another question. Since 5da9a0fb673a0ea0a093862f95f6b89b3390c31e
> cpuset_cpus_allowed_locked() is called without callback_mutex held by
> try_to_wake_up().
>
> And, without callback_mutex held, isn't it possible to race with, say,
> update_cpumask() which changes cpuset->cpus_allowed? Yes, update_tasks_cpumask()
> should fixup task->cpus_allowed later. But isn't it possible (at least
> in theory) that try_to_wake_up() gets, say, all-zeroes in task->cpus_allowed
> after select_fallback_rq()->cpuset_cpus_allowed_locked() if we race with
> update_cpumask()->cpumask_copy() ?

Hurmm,.. good point,.. yes I think that might be possible.
p->cpus_allowed is synchronized properly, but cs->cpus_allowed is not,
bugger.

I guess the quick fix is to really bail and always use cpu_online_mask
in select_fallback_rq().

Oleg Nesterov

unread,

Mar 10, 2010, 12:40:03 PM3/10/10

to

On 03/10, Peter Zijlstra wrote:
>
> On Tue, 2010-03-09 at 19:06 +0100, Oleg Nesterov wrote:
> > In particular, see http://marc.info/?l=linux-kernel&m=125261083613103
>
> /me puts it on the to-review stack.

Great, thanks. In fact, you already acked it before ;)

> > But now I have another question. Since 5da9a0fb673a0ea0a093862f95f6b89b3390c31e
> > cpuset_cpus_allowed_locked() is called without callback_mutex held by
> > try_to_wake_up().
> >
> > And, without callback_mutex held, isn't it possible to race with, say,
> > update_cpumask() which changes cpuset->cpus_allowed? Yes, update_tasks_cpumask()
> > should fixup task->cpus_allowed later. But isn't it possible (at least
> > in theory) that try_to_wake_up() gets, say, all-zeroes in task->cpus_allowed
> > after select_fallback_rq()->cpuset_cpus_allowed_locked() if we race with
> > update_cpumask()->cpumask_copy() ?
>
> Hurmm,.. good point,.. yes I think that might be possible.
> p->cpus_allowed is synchronized properly, but cs->cpus_allowed is not,
> bugger.
>
> I guess the quick fix is to really bail and always use cpu_online_mask
> in select_fallback_rq().

Yes, but this breaks cpusets.

Peter, please see the attached mbox with the last discussion and patches.
Of course, the changes in sched.c need the trivial fixups. I'll resend
if you think these changes make sense.

Oleg.

cpuset_lock.m

Peter Zijlstra

unread,

Mar 10, 2010, 1:10:01 PM3/10/10

to

On Wed, 2010-03-10 at 18:30 +0100, Oleg Nesterov wrote:
> On 03/10, Peter Zijlstra wrote:
> >
> > On Tue, 2010-03-09 at 19:06 +0100, Oleg Nesterov wrote:
> > > In particular, see http://marc.info/?l=linux-kernel&m=125261083613103
> >
> > /me puts it on the to-review stack.
>
> Great, thanks. In fact, you already acked it before ;)
>
> > > But now I have another question. Since 5da9a0fb673a0ea0a093862f95f6b89b3390c31e
> > > cpuset_cpus_allowed_locked() is called without callback_mutex held by
> > > try_to_wake_up().
> > >
> > > And, without callback_mutex held, isn't it possible to race with, say,
> > > update_cpumask() which changes cpuset->cpus_allowed? Yes, update_tasks_cpumask()
> > > should fixup task->cpus_allowed later. But isn't it possible (at least
> > > in theory) that try_to_wake_up() gets, say, all-zeroes in task->cpus_allowed
> > > after select_fallback_rq()->cpuset_cpus_allowed_locked() if we race with
> > > update_cpumask()->cpumask_copy() ?
> >
> > Hurmm,.. good point,.. yes I think that might be possible.
> > p->cpus_allowed is synchronized properly, but cs->cpus_allowed is not,
> > bugger.
> >
> > I guess the quick fix is to really bail and always use cpu_online_mask
> > in select_fallback_rq().
>
> Yes, but this breaks cpusets.

Arguably, so, but you can also argue that binding a task to a cpu and
then unplugging that cpu without first fixing up the affinity is a 'you
get to keep both pieces' situation.

> Peter, please see the attached mbox with the last discussion and patches.
> Of course, the changes in sched.c need the trivial fixups. I'll resend
> if you think these changes make sense.

Ah, right.. Damn I must be getting old, there wasn't even a spark of
recollection.

Right, so if you refresh these patches, I'll feed them to mingo and they
should eventually end up in -linus, how's that? :-)

Oleg Nesterov

unread,

Mar 10, 2010, 1:40:02 PM3/10/10

to

On 03/10, Peter Zijlstra wrote:
>
> Right, so if you refresh these patches, I'll feed them to mingo and they
> should eventually end up in -linus, how's that? :-)

Great. Will redo/resend tomorrow ;)

Oleg.

Oleg Nesterov

unread,

Mar 11, 2010, 10:00:02 AM3/11/10

to

On 03/10, Oleg Nesterov wrote:
>
> On 03/10, Peter Zijlstra wrote:
> >
> > Right, so if you refresh these patches, I'll feed them to mingo and they
> > should eventually end up in -linus, how's that? :-)
>
> Great. Will redo/resend tomorrow ;)

That was a great plan, but it doesn't work.

With the recent changes we have even more problems with
cpuset_cpus_allowed_locked(). Not only it misses cpuset_lock() (which
doesn't work anyway and must die imho), it is very wrong to even call
this function from try_to_wakeup() - this can deadlock.

Because task_cs() is protected by task_lock() which is not irq-safe,
and cpuset_cpus_allowed_locked() takes this lock.

We need more changes in cpuset.c. Btw, select_fallback_rq() takes
rcu_read_lock around cpuset_cpus_allowed_locked(). Why? I must have
missed something, but afaics this buys nothing.

From the previous email:

On 03/10, Peter Zijlstra wrote:
>

> On Wed, 2010-03-10 at 18:30 +0100, Oleg Nesterov wrote:

> > On 03/10, Peter Zijlstra wrote:
> > >

> > > I guess the quick fix is to really bail and always use cpu_online_mask
> > > in select_fallback_rq().
> >
> > Yes, but this breaks cpusets.
>
> Arguably, so, but you can also argue that binding a task to a cpu and
> then unplugging that cpu without first fixing up the affinity is a 'you
> get to keep both pieces' situation.

Well yes, but still it was supposed the kernel should handle this case
correctly, the task shouldn't escape its cpuset.

However, currently I don't see another option. I think we should fix the
possible deadlocks and kill cpuset_lock/cpuset_cpus_allowed_locked first,
then try to fix cpusets.

See the trivial (uncompiled) patch below. It just states a fact
cpuset_cpus_allowed() logic is broken.

How can we fix this later? Perhaps we can change
cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.

At first glance this should work in try_to_wake_up(p) case, we can't
race with cpuset_change_cpumask()/etc because of TASK_WAKING logic.

But I am not sure how can we fix move_task_off_dead_cpu(). I think
__migrate_task_irq() itself is fine, but if select_fallback_rq() is
called from move_task_off_dead_cpu() nothing protects ->cpus_allowed.
We can race with cpusets, or even with the plain set_cpus_allowed().
Probably nothing really bad can happen, if the resulting cpumask
doesn't have online cpus due to the racing memcpys, we should retry
after __migrate_task_irq() fails. Or we can take cpu_rq(cpu)-lock
around cpumask_copy(p->cpus_allowed, cpu_possible_mask).

sched_exec() seems fine, the task is current and running,
"No more Mr. Nice Guy." case is not possible.

What do you think?

Btw, I think there is a small bug in set_cpus_allowed_ptr(),
wake_up_process(rq->migration_thread) can hit NULL, we should do
wake_up_process(mt).

Oleg.

include/linux/cpuset.h | 10 ----------
kernel/cpuset.c | 29 +----------------------------
kernel/sched.c | 9 +++------
3 files changed, 4 insertions(+), 44 deletions(-)

--- x/include/linux/cpuset.h
+++ x/include/linux/cpuset.h
@@ -21,8 +21,6 @@ extern int number_of_cpusets; /* How man
extern int cpuset_init(void);
extern void cpuset_init_smp(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
-extern void cpuset_cpus_allowed_locked(struct task_struct *p,
- struct cpumask *mask);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
@@ -69,9 +67,6 @@ struct seq_file;
extern void cpuset_task_status_allowed(struct seq_file *m,
struct task_struct *task);

-extern void cpuset_lock(void);
-extern void cpuset_unlock(void);
-
extern int cpuset_mem_spread_node(void);

static inline int cpuset_do_page_mem_spread(void)
@@ -105,11 +100,6 @@ static inline void cpuset_cpus_allowed(s
{
cpumask_copy(mask, cpu_possible_mask);
}
-static inline void cpuset_cpus_allowed_locked(struct task_struct *p,
- struct cpumask *mask)
-{
- cpumask_copy(mask, cpu_possible_mask);
-}

static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
{
--- x/kernel/cpuset.c
+++ x/kernel/cpuset.c
@@ -2148,7 +2148,7 @@ void cpuset_cpus_allowed(struct task_str
* cpuset_cpus_allowed_locked - return cpus_allowed mask from a tasks cpuset.
* Must be called with callback_mutex held.
**/
-void cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask *pmask)
+static void cpuset_cpus_allowed_locked(struct task_struct *tsk, struct cpumask *pmask)
{
task_lock(tsk);
guarantee_online_cpus(task_cs(tsk), pmask);
@@ -2341,33 +2341,6 @@ int __cpuset_node_allowed_hardwall(int n
}

/**
- * cpuset_lock - lock out any changes to cpuset structures
- *
- * The out of memory (oom) code needs to mutex_lock cpusets
- * from being changed while it scans the tasklist looking for a
- * task in an overlapping cpuset. Expose callback_mutex via this
- * cpuset_lock() routine, so the oom code can lock it, before
- * locking the task list. The tasklist_lock is a spinlock, so
- * must be taken inside callback_mutex.
- */
-
-void cpuset_lock(void)
-{
- mutex_lock(&callback_mutex);
-}
-
-/**
- * cpuset_unlock - release lock on cpuset changes
- *
- * Undo the lock taken in a previous cpuset_lock() call.
- */
-
-void cpuset_unlock(void)
-{
- mutex_unlock(&callback_mutex);
-}
-
-/**
* cpuset_mem_spread_node() - On which node to begin search for a page
*
* If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for
--- x/kernel/sched.c
+++ x/kernel/sched.c
@@ -2289,10 +2289,9 @@ static int select_fallback_rq(int cpu, s

/* No more Mr. Nice Guy. */
if (dest_cpu >= nr_cpu_ids) {
- rcu_read_lock();
- cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
- rcu_read_unlock();
- dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
+ // XXX: take cpu_rq(cpu)->lock ???
+ cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
+ dest_cpu = cpumask_any(cpu_active_mask);

/*
* Don't tell them about moving exiting tasks or
@@ -5929,7 +5928,6 @@ migration_call(struct notifier_block *nf

case CPU_DEAD:
case CPU_DEAD_FROZEN:
- cpuset_lock(); /* around calls to cpuset_cpus_allowed_lock() */
migrate_live_tasks(cpu);
rq = cpu_rq(cpu);
kthread_stop(rq->migration_thread);
@@ -5943,7 +5941,6 @@ migration_call(struct notifier_block *nf
rq->idle->sched_class = &idle_sched_class;
migrate_dead_tasks(cpu);
raw_spin_unlock_irq(&rq->lock);
- cpuset_unlock();
migrate_nr_uninterruptible(rq);
BUG_ON(rq->nr_running != 0);
calc_global_load_remove(rq);

Oleg Nesterov

unread,

Mar 11, 2010, 10:30:02 AM3/11/10

to

On 03/11, Oleg Nesterov wrote:
>
> How can we fix this later? Perhaps we can change
> cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
> fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.

Wait. We need to fix the CPU_DEAD case anyway?

Hmm. 6ad4c18884e864cf4c77f9074d3d1816063f99cd
"sched: Fix balance vs hotplug race" did s/CPU_DEAD/CPU_DOWN_PREPARE/
in cpuset_track_online_cpus(). This doesn't look exactly right to me,
we shouldn't do remove_tasks_in_empty_cpuset() at CPU_DOWN_PREPARE
stage, it can fail.

Otoh. This means that move_task_of_dead_cpu() can never see the
task without active cpus in ->cpus_allowed, it is called later by
CPU_DEAD. So, cpuset_lock() is not needed at all.

Oleg.

Peter Zijlstra

unread,

Mar 11, 2010, 10:40:02 AM3/11/10

to

On Thu, 2010-03-11 at 15:52 +0100, Oleg Nesterov wrote:
> On 03/10, Oleg Nesterov wrote:
> >
> > On 03/10, Peter Zijlstra wrote:
> > >
> > > Right, so if you refresh these patches, I'll feed them to mingo and they
> > > should eventually end up in -linus, how's that? :-)
> >
> > Great. Will redo/resend tomorrow ;)
>
> That was a great plan, but it doesn't work.
>
> With the recent changes we have even more problems with
> cpuset_cpus_allowed_locked(). Not only it misses cpuset_lock() (which
> doesn't work anyway and must die imho), it is very wrong to even call
> this function from try_to_wakeup() - this can deadlock.
>
> Because task_cs() is protected by task_lock() which is not irq-safe,
> and cpuset_cpus_allowed_locked() takes this lock.

You're right, and lockdep doesn't normally warn about that because
nobody really hits this path :/

> We need more changes in cpuset.c. Btw, select_fallback_rq() takes
> rcu_read_lock around cpuset_cpus_allowed_locked(). Why? I must have
> missed something, but afaics this buys nothing.

for task_cs() iirc.

> From the previous email:
>
> On 03/10, Peter Zijlstra wrote:
> >
> > On Wed, 2010-03-10 at 18:30 +0100, Oleg Nesterov wrote:
> > > On 03/10, Peter Zijlstra wrote:
> > > >
> > > > I guess the quick fix is to really bail and always use cpu_online_mask
> > > > in select_fallback_rq().
> > >
> > > Yes, but this breaks cpusets.
> >
> > Arguably, so, but you can also argue that binding a task to a cpu and
> > then unplugging that cpu without first fixing up the affinity is a 'you
> > get to keep both pieces' situation.
>
> Well yes, but still it was supposed the kernel should handle this case
> correctly, the task shouldn't escape its cpuset.
>
> However, currently I don't see another option. I think we should fix the
> possible deadlocks and kill cpuset_lock/cpuset_cpus_allowed_locked first,
> then try to fix cpusets.
>
> See the trivial (uncompiled) patch below. It just states a fact
> cpuset_cpus_allowed() logic is broken.
>
> How can we fix this later? Perhaps we can change
> cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
> fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.

Problem is, we can't really fix up tasks, wakeup must be able to find a
suitable cpu.

> At first glance this should work in try_to_wake_up(p) case, we can't
> race with cpuset_change_cpumask()/etc because of TASK_WAKING logic.

Well, cs->cpus_possible can still go funny on us.

> But I am not sure how can we fix move_task_off_dead_cpu(). I think
> __migrate_task_irq() itself is fine, but if select_fallback_rq() is
> called from move_task_off_dead_cpu() nothing protects ->cpus_allowed.

It has that retry loop in case the migration fails, right?

> We can race with cpusets, or even with the plain set_cpus_allowed().
> Probably nothing really bad can happen, if the resulting cpumask
> doesn't have online cpus due to the racing memcpys, we should retry
> after __migrate_task_irq() fails. Or we can take cpu_rq(cpu)-lock
> around cpumask_copy(p->cpus_allowed, cpu_possible_mask).

It does the retry thing.

> sched_exec() seems fine, the task is current and running,
> "No more Mr. Nice Guy." case is not possible.
>
> What do you think?
>
> Btw, I think there is a small bug in set_cpus_allowed_ptr(),
> wake_up_process(rq->migration_thread) can hit NULL, we should do
> wake_up_process(mt).

Agreed.

> @@ -2289,10 +2289,9 @@ static int select_fallback_rq(int cpu, s
>
> /* No more Mr. Nice Guy. */
> if (dest_cpu >= nr_cpu_ids) {
> - rcu_read_lock();
> - cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> - rcu_read_unlock();
> - dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
> + // XXX: take cpu_rq(cpu)->lock ???
> + cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> + dest_cpu = cpumask_any(cpu_active_mask);

Right, this seems safe.

Peter Zijlstra

unread,

Mar 11, 2010, 10:50:02 AM3/11/10

to

On Thu, 2010-03-11 at 16:22 +0100, Oleg Nesterov wrote:
> On 03/11, Oleg Nesterov wrote:
> >
> > How can we fix this later? Perhaps we can change
> > cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
> > fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.
>
> Wait. We need to fix the CPU_DEAD case anyway?
>
> Hmm. 6ad4c18884e864cf4c77f9074d3d1816063f99cd
> "sched: Fix balance vs hotplug race" did s/CPU_DEAD/CPU_DOWN_PREPARE/
> in cpuset_track_online_cpus(). This doesn't look exactly right to me,
> we shouldn't do remove_tasks_in_empty_cpuset() at CPU_DOWN_PREPARE
> stage, it can fail.

Sure, tough luck for those few tasks.

> Otoh. This means that move_task_of_dead_cpu() can never see the
> task without active cpus in ->cpus_allowed, it is called later by
> CPU_DEAD. So, cpuset_lock() is not needed at all.

Right,.. so the whole problem is cpumask ops are terribly expensive
since we got this CONFIG_CPUMASK_OFFSTACK muck, so we try to reduce
these ops in the regular scheduling paths, in the patch you referenced
above the tradeof was between fixing the sched_domains up too often vs
adding a cpumask_and in a hot-path, guess who won ;-)

Peter Zijlstra

unread,

Mar 11, 2010, 11:30:03 AM3/11/10

to

On Thu, 2010-03-11 at 17:19 +0100, Oleg Nesterov wrote:

> > > How can we fix this later? Perhaps we can change
> > > cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
> > > fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.
> >
> > Problem is, we can't really fix up tasks, wakeup must be able to find a
> > suitable cpu.
>

> Yes sure. I meant, wakeup()->select_fallback_rq() sets cpus_allowed =
> cpu_possible_map as we discussed. Then cpuset_track_online_cpus(CPU_DEAD)
> fixes the affected tasks.

Ah, have that re-validate the p->cpus_allowed for all cpuset tasks, ok
that might work.

> > > At first glance this should work in try_to_wake_up(p) case, we can't
> > > race with cpuset_change_cpumask()/etc because of TASK_WAKING logic.
> >
> > Well, cs->cpus_possible can still go funny on us.
>

> What do you mean? Afaics, cpusets always uses set_cpus_allowed() to
> change task->cpus_allowed.

Confusion^2 ;-), I failed to grasp your fixup idea and got confused,
which confused you.

> > > But I am not sure how can we fix move_task_off_dead_cpu(). I think
> > > __migrate_task_irq() itself is fine, but if select_fallback_rq() is
> > > called from move_task_off_dead_cpu() nothing protects ->cpus_allowed.
> >
> > It has that retry loop in case the migration fails, right?
> >
> > > We can race with cpusets, or even with the plain set_cpus_allowed().
> > > Probably nothing really bad can happen, if the resulting cpumask
> > > doesn't have online cpus due to the racing memcpys, we should retry
> > > after __migrate_task_irq() fails. Or we can take cpu_rq(cpu)-lock
> > > around cpumask_copy(p->cpus_allowed, cpu_possible_mask).
> >
> > It does the retry thing.
>

> Yes, I mentioned retry logic too. But it can't always help, even without
> cpusets.
>
> Suppose a task T is bound to the dead CPU, and move_task_off_dead_cpu()
> races with set_cpus_allowed(new_mask). I think it is fine if T gets
> either new_mask or cpu_possible_map in ->cpus_allowed. But, it can get
> a "random" mix if 2 memcpy() run in parallel. And it is possible that
> __migrate_task_irq() will not fail if dest_cpu falls into resulting mask.

Ah indeed. One would almost construct a cpumask_assign that uses RCU
atomic pointer assignment for all this stupid cpumask juggling :/

> > > @@ -2289,10 +2289,9 @@ static int select_fallback_rq(int cpu, s
> > >
> > > /* No more Mr. Nice Guy. */
> > > if (dest_cpu >= nr_cpu_ids) {
> > > - rcu_read_lock();
> > > - cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> > > - rcu_read_unlock();
> > > - dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
> > > + // XXX: take cpu_rq(cpu)->lock ???
> > > + cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> > > + dest_cpu = cpumask_any(cpu_active_mask);
> >
> >
> > Right, this seems safe.
>

> OK, I'll try to read this code a bit more and then send this patch.

Thanks!

Oleg Nesterov

unread,

Mar 11, 2010, 11:30:03 AM3/11/10

to

On 03/11, Peter Zijlstra wrote:
>
> On Thu, 2010-03-11 at 15:52 +0100, Oleg Nesterov wrote:
>
> > Btw, select_fallback_rq() takes
> > rcu_read_lock around cpuset_cpus_allowed_locked(). Why? I must have
> > missed something, but afaics this buys nothing.
>
> for task_cs() iirc.

it should be stable under task_lock()... Never mind.

> > How can we fix this later? Perhaps we can change
> > cpuset_track_online_cpus(CPU_DEAD) to scan all affected cpusets and
> > fixup the tasks with the wrong ->cpus_allowed == cpu_possible_mask.
>
> Problem is, we can't really fix up tasks, wakeup must be able to find a
> suitable cpu.

Yes sure. I meant, wakeup()->select_fallback_rq() sets cpus_allowed =

cpu_possible_map as we discussed. Then cpuset_track_online_cpus(CPU_DEAD)
fixes the affected tasks.

> > At first glance this should work in try_to_wake_up(p) case, we can't

> > race with cpuset_change_cpumask()/etc because of TASK_WAKING logic.
>
> Well, cs->cpus_possible can still go funny on us.

What do you mean? Afaics, cpusets always uses set_cpus_allowed() to
change task->cpus_allowed.

> > But I am not sure how can we fix move_task_off_dead_cpu(). I think

> > __migrate_task_irq() itself is fine, but if select_fallback_rq() is
> > called from move_task_off_dead_cpu() nothing protects ->cpus_allowed.
>
> It has that retry loop in case the migration fails, right?
>
> > We can race with cpusets, or even with the plain set_cpus_allowed().
> > Probably nothing really bad can happen, if the resulting cpumask
> > doesn't have online cpus due to the racing memcpys, we should retry
> > after __migrate_task_irq() fails. Or we can take cpu_rq(cpu)-lock
> > around cpumask_copy(p->cpus_allowed, cpu_possible_mask).
>
> It does the retry thing.

Yes, I mentioned retry logic too. But it can't always help, even without
cpusets.

Suppose a task T is bound to the dead CPU, and move_task_off_dead_cpu()
races with set_cpus_allowed(new_mask). I think it is fine if T gets
either new_mask or cpu_possible_map in ->cpus_allowed. But, it can get
a "random" mix if 2 memcpy() run in parallel. And it is possible that
__migrate_task_irq() will not fail if dest_cpu falls into resulting mask.

> > @@ -2289,10 +2289,9 @@ static int select_fallback_rq(int cpu, s

> >
> > /* No more Mr. Nice Guy. */
> > if (dest_cpu >= nr_cpu_ids) {
> > - rcu_read_lock();
> > - cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> > - rcu_read_unlock();
> > - dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
> > + // XXX: take cpu_rq(cpu)->lock ???
> > + cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> > + dest_cpu = cpumask_any(cpu_active_mask);
>
>
> Right, this seems safe.

OK, I'll try to read this code a bit more and then send this patch.

Oleg.

Oleg Nesterov

unread,

Mar 13, 2010, 2:40:01 PM3/13/10

to

On 03/11, Oleg Nesterov wrote:
>

> > > @@ -2289,10 +2289,9 @@ static int select_fallback_rq(int cpu, s
> > >
> > > /* No more Mr. Nice Guy. */
> > > if (dest_cpu >= nr_cpu_ids) {
> > > - rcu_read_lock();
> > > - cpuset_cpus_allowed_locked(p, &p->cpus_allowed);
> > > - rcu_read_unlock();
> > > - dest_cpu = cpumask_any_and(cpu_active_mask, &p->cpus_allowed);
> > > + // XXX: take cpu_rq(cpu)->lock ???
> > > + cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> > > + dest_cpu = cpumask_any(cpu_active_mask);
> >
> >
> > Right, this seems safe.
>
> OK, I'll try to read this code a bit more and then send this patch.

No, it is not safe :/

Peter, I seem to see the simple fix for the discussed cpuset problems,
but it turns out sched.c has more problems evem without cpusets.

I'll try to send the whole series on Monday, but perhaps you can look
at the attached compile-tested patches (especially 2 and 3) to quickly
correct me if the 3rd one is wrong.

The subsequent fix in cpuset.c depends on the locking rules enforced
by 2-3.

Oleg.

1_KILL_CPUSET_LOCK.patch

2_MTODC_TAKE_RQ_LOCK.patch

3_SCHED_EXEC_DONT_FALLBACK.patch

4_CPU_DOWN_AFFINITY.patch

Peter Zijlstra

unread,

Mar 13, 2010, 9:20:01 PM3/13/10

to

On Sat, 2010-03-13 at 20:28 +0100, Oleg Nesterov wrote:
>
> Peter, I seem to see the simple fix for the discussed cpuset problems,
> but it turns out sched.c has more problems evem without cpusets.
>
> I'll try to send the whole series on Monday, but perhaps you can look
> at the attached compile-tested patches (especially 2 and 3) to quickly
> correct me if the 3rd one is wrong.
>
> The subsequent fix in cpuset.c depends on the locking rules enforced
> by 2-3.

In as far as I can trust my jet-lagged brain, they look good. I'll try
and have another look later.

Q: select_fallback_rq() && cpuset_lock()

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

Peter Zijlstra

Oleg Nesterov

Oleg Nesterov

Oleg Nesterov

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Oleg Nesterov

Oleg Nesterov

Peter Zijlstra