[BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline

Yi Yang

unread,

Mar 3, 2008, 1:30:14 AM3/3/08

to

When i run cpu hotplug stress test on a system with 16 logical cpus,
some processes are deadlocked, it can be regenerated on 2.6.25-rc2
and 2.6.25-rc3.

The kernel thread [watchdog/1] isn't reaped by its parent when cpu #1
is set to offline, so that [kstopmachine] sleeps for ever.

After i investigated, debugged and analyzed it, i think it is a scheduler
specific issue, i noticed someone submitted a patch which mentioned
kthread_should_stop calling rule:

"In order to safely use kthread_stop() for kthread, there is a requirement
on how its main loop has to be orginized. Namely, the sequence of
events that lead to kthread being blocked (schedule()) has to be
ordered as follows:

set_current_state(TASK_INTERRUPTIBLE);
if (kthread_should_stop()) break;
schedule() or similar.

set_current_state() implies a full memory barrier. kthread_stop()
has a matching barrier right after an update of kthread_stop_info.k
and before kthread's wakeup."

I don't think it is a good programming paradigm because it will result
in some issues very easily, why can't we provide a simple wrapper for it?

This issue seems such one, but i tried to change it to follow this rule but
the issue is still there.

Why isn't the kernel thread [watchdog/1] reaped by its parent? its state
is TASK_RUNNING with high priority (R< means this), why it isn't done?

Anyone ever met such a problem? Your thought?

Here is my stress test script, you may try it on SMP platforms (the more
number of cpu is , the easier it can be regenerated.).

########################################################################

####### stresscpus.sh #######

#!/bin/sh

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
do
./bugexposure.sh $i &
done

######## bugexposure.sh ###########

#!/bin/bash

main(){

typeset -i CPU=$1
./task.sh > /dev/null&
PID=$!

if [ `cat /sys/devices/system/cpu/cpu${CPU}/online` = "0" ]; then
echo "1" > /sys/devices/system/cpu/cpu${CPU}/online
fi

MASK=$((1<<${CPU}))

`taskset -p ${MASK} ${PID} > /dev/null 2>&1`

echo "0" > /sys/devices/system/cpu/cpu${CPU}/online

echo "1" > /sys/devices/system/cpu/cpu${CPU}/online

disown $PID
kill -9 $PID > /dev/null 2>&1

#echo "PASS\n"

}

typeset -i TEST_CPU=$1
while true
do
main $TEST_CPU
done

###### task.sh ######
#!/bin/bash

while :
do
NOOP=1
done

########################################################################

My tracing and analysis info is below:

# ps aux | grep watchdog
root 5 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/0]
root 8 0.0 0.0 0 0 ? R< 22:26 0:00 [watchdog/1]
root 11 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/2]
root 14 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/3]
root 17 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/4]
root 20 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/5]
root 23 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/6]
root 26 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/7]
root 29 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/8]
root 32 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/9]
root 35 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/10]
root 38 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/11]
root 41 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/12]
root 44 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/13]
root 50 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/15]
root 5641 0.0 0.0 61144 708 pts/0 R+ 22:58 0:00 grep watchdog
#

# ps aux | grep ksoftirqd
root 4 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/0]
root 10 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/2]
root 13 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/3]
root 16 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/4]
root 19 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/5]
root 22 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/6]
root 25 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/7]
root 28 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/8]
root 31 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/9]
root 34 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/10]
root 37 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/11]
root 40 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/12]
root 43 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/13]
root 49 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/15]
root 5647 0.0 0.0 61144 712 pts/0 R+ 23:00 0:00 grep ksoftirqd
#

#ps aux
...
root 5554 0.0 0.0 0 0 ? S< 22:42 0:00 [kstopmachine]
root 5556 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5557 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5558 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5559 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5560 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5561 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5562 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5563 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5564 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5565 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5566 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5567 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5568 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5569 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
...

[watchdog/1] isn't terminated so that [kstopmachine] will waiting for it
for ever, so that "echo 0 > /sys/devices/syste/cpu/cpu1/online" will hang
up, so that hotplug spinlock will be holden by it, so that other spinlock
contenter will wait for locking it for ever, so that...

These are some softlock detection output:

process 5471 (task.sh) no longer affine to cpu1
INFO: task group_balance:51 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
group_balance D ffff810001011580 0 51 2
ffff81027e97fdc0 0000000000000046 ffff81027e97fdd0 ffffffff80472b15
0000000000000000 ffff81027e97d280 ffff81027c8bd300 ffff81027e97d5d0
0000000080569da0 ffff81027e97d5d0 000000007e97d280 000000010009d366
Call Trace:
[<ffffffff80472b15>] thread_return+0x3d/0xa2
[<ffffffff8023cce5>] lock_timer_base+0x26/0x4b
[<ffffffff8047338b>] __mutex_lock_slowpath+0x5c/0x93
[<ffffffff80473221>] mutex_lock+0x1a/0x1e
[<ffffffff8024f8fd>] get_online_cpus+0x27/0x3e
[<ffffffff8022fd04>] load_balance_monitor+0x60/0x375
[<ffffffff8022fca4>] load_balance_monitor+0x0/0x375
[<ffffffff80245ff6>] kthread+0x47/0x75
[<ffffffff802301f0>] schedule_tail+0x28/0x5c
[<ffffffff8020cc78>] child_rip+0xa/0x12
[<ffffffff80245faf>] kthread+0x0/0x75
[<ffffffff8020cc6e>] child_rip+0x0/0x12

INFO: task bugexposure.sh:5462 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bugexposure.s D ffff8100010e1580 0 5462 1
ffff81026a981c58 0000000000000082 ffff81026a981c68 ffffffff80472b15
00000000000004d6 ffff81027c83c920 ffff81027e8bd680 ffff81027c83cc70
0000000d7e8bc4c0 ffff81027c83cc70 0000000d00000002 000000010009d809
Call Trace:
[<ffffffff80472b15>] thread_return+0x3d/0xa2
[<ffffffff80314b25>] __next_cpu+0x19/0x28
[<ffffffff8022a5d4>] enqueue_rt_entity+0x4e/0xb1
[<ffffffff80472f34>] schedule_timeout+0x1e/0xad
[<ffffffff802280bf>] enqueue_task+0x4d/0x58
[<ffffffff80472ccf>] wait_for_common+0xd5/0x118
[<ffffffff8022c31b>] default_wake_function+0x0/0xe
[<ffffffff80245c93>] kthread_stop+0x58/0x79
[<ffffffff8046fdcf>] cpu_callback+0x189/0x197
[<ffffffff8046f59e>] cpu_callback+0x1cf/0x274
[<ffffffff8026c07e>] writeback_set_ratelimit+0x19/0x68
[<ffffffff8047658b>] notifier_call_chain+0x29/0x4c
[<ffffffff8024f74d>] _cpu_down+0x1d6/0x2aa
[<ffffffff8024f845>] cpu_down+0x24/0x31
[<ffffffff8038fcf4>] store_online+0x29/0x67
[<ffffffff802d23e1>] sysfs_write_file+0xd2/0x110
[<ffffffff8028db49>] vfs_write+0xad/0x136
[<ffffffff8028e086>] sys_write+0x45/0x6e
[<ffffffff8020bfd9>] tracesys+0xdc/0xe1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ingo Molnar

unread,

Mar 3, 2008, 7:00:11 AM3/3/08

to

* Dmitry Adamushko <dmitry.a...@gmail.com> wrote:

> per_cpu(watchdog_task, hotcpu) = NULL;
> + mlseep(1);

that wont build very well ...

Ingo

Dmitry Adamushko

unread,

Mar 3, 2008, 7:00:11 AM3/3/08

to

On 02/03/2008, Yi Yang <yi.y...@intel.com> wrote:
> [ ... ]

>
> Why isn't the kernel thread [watchdog/1] reaped by its parent?
> its state
> is TASK_RUNNING with high priority (R< means this), why it isn't done?
>
> Anyone ever met such a problem? Your thought?
>

iirc, Andrew had the same issue.

'watchdog's are supposed to be stopped with kthread_stop() from
softlockup.c :: cpu_callback() :: case CPU_DEAD.

The 'R' state of 'watchdog' is strange indeed.

would you please conduct a test with the patch [1] below and provide
me with additional output it generates?

(non-white-space-damaged versions are enclosed)

with your current set-up and also with
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs"

Then please try also the patch [2] (on top of [1]). With this 'magic'
patch applied, the issue 'seemed' to disappear on Andrew's set-up...

Thanks in advacne,

[1]

--- softlockup-prev.c 2008-03-03 12:35:01.000000000 +0100
+++ softlockup.c 2008-03-03 12:38:01.000000000 +0100
@@ -237,6 +237,8 @@ static int watchdog(void *__bind_cpu)

}

+ printk(KERN_WARN "-> watchdog(cpu: %d) is done\n", this_cpu);
+
return 0;
}

@@ -249,6 +251,9 @@ cpu_callback(struct notifier_block *nfb,
int hotcpu = (unsigned long)hcpu;
struct task_struct *p;

+ printk(KERN_WARN "-> cpu_callback(cpu: %d, action: %lu,
check_cpu: %d)\n",
+ hotcpu, action, check_cpu);
+
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:

[2]

--- softlockup-prev-2.c 2008-03-03 12:38:36.000000000 +0100
+++ softlockup.c 2008-03-03 12:39:02.000000000 +0100
@@ -294,6 +294,7 @@ cpu_callback(struct notifier_block *nfb,
case CPU_DEAD_FROZEN:
p = per_cpu(watchdog_task, hotcpu);

per_cpu(watchdog_task, hotcpu) = NULL;
+ mlseep(1);

kthread_stop(p);
break;
#endif /* CONFIG_HOTPLUG_CPU */

--
Best regards,
Dmitry Adamushko

softlockup-debug-1.patch

softlockup-debug-2.patch

Dmitry Adamushko

unread,

Mar 3, 2008, 7:10:10 AM3/3/08

to

On 03/03/2008, Ingo Molnar <mi...@elte.hu> wrote:
>
> * Dmitry Adamushko <dmitry.a...@gmail.com> wrote:
>
> > per_cpu(watchdog_task, hotcpu) = NULL;
> > + mlseep(1);
>
>
> that wont build very well ...

yeah, I forgot to mention that it's not even compile-tested :-/
I re-created it from scratch instead of looking for the original one.

please, this one (again, not compile-tested)

--- softlockup-prev-2.c 2008-03-03 12:38:36.000000000 +0100

+++ softlockup.c 2008-03-03 13:00:20.000000000 +0100

@@ -294,6 +294,7 @@ cpu_callback(struct notifier_block *nfb,
case CPU_DEAD_FROZEN:
p = per_cpu(watchdog_task, hotcpu);

per_cpu(watchdog_task, hotcpu) = NULL;
+ msleep(1);

softlockup-debug-2.patch

Gautham R Shenoy

unread,

Mar 3, 2008, 10:40:15 AM3/3/08

to

Hi Yi,

This is indeed strange. I am able to reproduce this problem on my 4-way
box. From what I see in the past two runs, we're waiting in the
cpu-hotplug callback path for the watchdog/1 thread to stop.

During cpu-offline, once the cpu goes offline, in the migration_call(),
we migrate any tasks associated with the offline cpus
to some other cpu. This also mean breaking affinity for tasks which were
affined to the cpu which went down. So watchdog/1 has been migrated to
some other cpu.

However, it remains in R< state and has not executed the
kthread_should_stop() instruction.

I'm trying to probe further by inserting a few more printk's in there.

Will post the findings in a couple of hours.

Thanks for reporting the problem.

Regards
gautham.

Yi Yang

unread,

Mar 3, 2008, 9:40:11 PM3/3/08

to

On Mon, 2008-03-03 at 21:01 +0530, Gautham R Shenoy wrote:
> > This issue seems such one, but i tried to change it to follow this rule but
> > the issue is still there.
> >
> > Why isn't the kernel thread [watchdog/1] reaped by its parent? its state
> > is TASK_RUNNING with high priority (R< means this), why it isn't done?
> >
> > Anyone ever met such a problem? Your thought?
>
> Hi Yi,
>
> This is indeed strange. I am able to reproduce this problem on my 4-way
> box. From what I see in the past two runs, we're waiting in the
> cpu-hotplug callback path for the watchdog/1 thread to stop.
>
> During cpu-offline, once the cpu goes offline, in the migration_call(),
> we migrate any tasks associated with the offline cpus
> to some other cpu. This also mean breaking affinity for tasks which were
> affined to the cpu which went down. So watchdog/1 has been migrated to
> some other cpu.

No, [watchdog/1] is just for CPU #1, if CPU #1 has been offline, it
should be killed but not migrated to other CPU because other CPU has
such a kthread.

Maybe migration_call was doing such a bad thing. :-)

>
> However, it remains in R< state and has not executed the
> kthread_should_stop() instruction.
>
> I'm trying to probe further by inserting a few more printk's in there.
>
> Will post the findings in a couple of hours.
>
> Thanks for reporting the problem.
>
> Regards
> gautham.

Yi Yang

unread,

Mar 3, 2008, 9:50:07 PM3/3/08

to

I don't think it can fix this issue, it only gives one chance to
scheduler, i think there are another potential and very serious issues
inside of scheduler or locking or what else we don't know.

Maybe migration is a doubtful point as Gautham mentioned.

Gautham R Shenoy

unread,

Mar 4, 2008, 12:30:12 AM3/4/08

to

On Mon, Mar 03, 2008 at 10:45:04PM +0800, Yi Yang wrote:
> On Mon, 2008-03-03 at 21:01 +0530, Gautham R Shenoy wrote:
> > > This issue seems such one, but i tried to change it to follow this rule but
> > > the issue is still there.
> > >
> > > Why isn't the kernel thread [watchdog/1] reaped by its parent? its state
> > > is TASK_RUNNING with high priority (R< means this), why it isn't done?
> > >
> > > Anyone ever met such a problem? Your thought?
> >
> > Hi Yi,
> >
> > This is indeed strange. I am able to reproduce this problem on my 4-way
> > box. From what I see in the past two runs, we're waiting in the
> > cpu-hotplug callback path for the watchdog/1 thread to stop.
> >
> > During cpu-offline, once the cpu goes offline, in the migration_call(),
> > we migrate any tasks associated with the offline cpus
> > to some other cpu. This also mean breaking affinity for tasks which were
> > affined to the cpu which went down. So watchdog/1 has been migrated to
> > some other cpu.
> No, [watchdog/1] is just for CPU #1, if CPU #1 has been offline, it
> should be killed but not migrated to other CPU because other CPU has
> such a kthread.

Yes, it is killed once it gets a chance to run *after* cpu goes offline.
The moment it runs on some other cpu, it will see the kthread_should_stop()
because in the cpu-hotplug callback path we've issues a
kthread_stop(watchdog/1)

Again, we can argue that we could issue a kthread_stop()
in CPU_DOWN_PREPARE, rather than in CPU_DEAD and restart
it in CPU_DOWN_FAILED if the cpu-hotplug operation does fail.

>
> Maybe migration_call was doing such a bad thing. :-)

Nope, from what I see migration call is not having any problems. It is
behaving the way it is supposed to behave :)

The other observation I noted was the WARN_ON_ONCE() in hrtick() [1]
that I am consistently hitting after the first cpu goes offline.

So at times, the callback thread is blocked on kthread_stop(k) in
softlockup.c, while other time, it was blocked in
cleanup_workqueue_threads() in workqueue.c.

This was with the debug patch[2]

Not sure if this is linked to the problem that Yi has pointed out
but looks like a regression. I'll see if this can be reproduced on
2.6.24, 2.6.25-rc1 and 2.6.25-rc2.

[1] The WARN_ON_ONCE() trace.

------------[ cut here ]------------
WARNING: at kernel/sched.c:1007 hrtick+0x32/0x6a()
Modules linked in: dock
Pid: 4451, comm: bash Not tainted 2.6.25-rc3 #26
[<c011f6c8>] warn_on_slowpath+0x41/0x51
[<c013a0dd>] ? trace_hardirqs_on+0xd3/0x111
[<c04e43a3>] ? _spin_unlock_irqrestore+0x42/0x58
[<c02767e7>] ? blk_run_queue+0x64/0x68
[<c033ae6e>] ? scsi_run_queue+0x18d/0x195
[<c027fd7b>] ? kobject_put+0x14/0x16
[<c02e1c3f>] ? put_device+0x11/0x13
[<c013af7b>] ? __lock_acquire+0xaae/0xaf6
[<c01320bf>] ? __run_hrtimer+0x35/0x70
[<c0119a8a>] hrtick+0x32/0x6a
[<c0119a58>] ? hrtick+0x0/0x6a
[<c01320c3>] __run_hrtimer+0x39/0x70
[<c01328e8>] hrtimer_interrupt+0xed/0x156
[<c0112db9>] smp_apic_timer_interrupt+0x6c/0x7f
[<c010568b>] apic_timer_interrupt+0x33/0x38
[<c01202ed>] ? vprintk+0x2d0/0x328
[<c027fe47>] ? kobject_release+0x4b/0x50
[<c027fdfc>] ? kobject_release+0x0/0x50
[<c04dea24>] ? cpuid_class_cpu_callback+0x0/0x50
[<c0280931>] ? kref_put+0x39/0x44
[<c027fd7b>] ? kobject_put+0x14/0x16
[<c02e1c3f>] ? put_device+0x11/0x13
[<c014e2e3>] ? cpu_swap_callback+0x0/0x3d
[<c012035a>] printk+0x15/0x17
[<c04e61d0>] notifier_call_chain+0x40/0x9b
[<c04e2d25>] ? mutex_unlock+0x8/0xa
[<c0143c29>] ? __stop_machine_run+0x8c/0x95
[<c013e1b3>] ? take_cpu_down+0x0/0x27
[<c01331e8>] __raw_notifier_call_chain+0xe/0x10
[<c01331f6>] raw_notifier_call_chain+0xc/0xe
[<c013e37e>] _cpu_down+0x1a4/0x269
[<c013e466>] cpu_down+0x23/0x30
[<c02e58e7>] store_online+0x27/0x5a
[<c02e58c0>] ? store_online+0x0/0x5a
[<c02e2a9c>] sysdev_store+0x20/0x25
[<c0197e65>] sysfs_write_file+0xad/0xdf
[<c0197db8>] ? sysfs_write_file+0x0/0xdf
[<c0165099>] vfs_write+0x8c/0x108
[<c0165623>] sys_write+0x3b/0x60
[<c0104b12>] sysenter_past_esp+0x5f/0xa5
=======================
---[ end trace 22cbd9e369049151 ]---

[2] The debug patch
----->

Index: linux-2.6.25-rc3/kernel/cpu.c
===================================================================
--- linux-2.6.25-rc3.orig/kernel/cpu.c
+++ linux-2.6.25-rc3/kernel/cpu.c
@@ -18,7 +18,7 @@
/* Serializes the updates to cpu_online_map, cpu_present_map */
static DEFINE_MUTEX(cpu_add_remove_lock);

-static __cpuinitdata RAW_NOTIFIER_HEAD(cpu_chain);
+__cpuinitdata RAW_NOTIFIER_HEAD(cpu_chain);

/* If set, cpu_up and cpu_down will return -EBUSY and do nothing.
* Should always be manipulated under cpu_add_remove_lock
@@ -207,11 +207,14 @@ static int _cpu_down(unsigned int cpu, i
if (!cpu_online(cpu))
return -EINVAL;

+ printk("[HOTPLUG] calling cpu_hotplug_begin\n");
cpu_hotplug_begin();
+ printk("[HOTPLUG] calling CPU_DOWN_PREPARE\n");
err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
hcpu, -1, &nr_calls);
if (err == NOTIFY_BAD) {
nr_calls--;
+ printk("[HOTPLUG] calling CPU_DOWN_FAILED\n");
__raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
hcpu, nr_calls, NULL);
printk("%s: attempt to take down CPU %u failed\n",
@@ -226,10 +229,12 @@ static int _cpu_down(unsigned int cpu, i
cpu_clear(cpu, tmp);
set_cpus_allowed(current, tmp);

+ printk("[HOTPLUG] calling stop_machine_run()\n");
p = __stop_machine_run(take_cpu_down, &tcd_param, cpu);

if (IS_ERR(p) || cpu_online(cpu)) {
/* CPU didn't die: tell everyone. Can't complain. */
+ printk("[HOTPLUG] calling CPU_DOWN_FAILED\n");
if (raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
hcpu) == NOTIFY_BAD)
BUG();
@@ -241,13 +246,16 @@ static int _cpu_down(unsigned int cpu, i
goto out_thread;
}

+ printk("[HOTPLUG] waiting for idle_cpu()\n");
/* Wait for it to sleep (leaving idle task). */
while (!idle_cpu(cpu))
yield();

+ printk("[HOTPLUG] calling __cpu_die()\n");
/* This actually kills the CPU. */
__cpu_die(cpu);

+ printk("[HOTPLUG] calling CPU_DEAD\n");
/* CPU is completely dead: tell everyone. Too late to complain. */
if (raw_notifier_call_chain(&cpu_chain, CPU_DEAD | mod,
hcpu) == NOTIFY_BAD)
@@ -256,11 +264,14 @@ static int _cpu_down(unsigned int cpu, i
check_for_tasks(cpu);

out_thread:
+ printk("[HOTPLUG] calling kthread_stop_machine\n");
err = kthread_stop(p);
out_allowed:
set_cpus_allowed(current, old_allowed);
out_release:
+ printk("[HOTPLUG] calling cpu_hotplug_done()\n");
cpu_hotplug_done();
+ printk("[HOTPLUG] returning from _cpu_down()\n");
return err;
}

Index: linux-2.6.25-rc3/kernel/notifier.c
===================================================================
--- linux-2.6.25-rc3.orig/kernel/notifier.c
+++ linux-2.6.25-rc3/kernel/notifier.c
@@ -5,7 +5,7 @@
#include <linux/rcupdate.h>
#include <linux/vmalloc.h>
#include <linux/reboot.h>
-
+#include <linux/kallsyms.h>
/*
* Notifier list for kernel code which wants to be called
* at shutdown. This is used to stop any idling DMA operations
@@ -44,6 +44,7 @@ static int notifier_chain_unregister(str
return -ENOENT;
}

+extern struct raw_notifier_head cpu_chain;
/**
* notifier_call_chain - Informs the registered notifiers about an event.
* @nl: Pointer to head of the blocking notifier chain
@@ -62,12 +63,21 @@ static int __kprobes notifier_call_chain
{
int ret = NOTIFY_DONE;
struct notifier_block *nb, *next_nb;
+ char name_buf[100];

nb = rcu_dereference(*nl);

while (nb && nr_to_call) {
next_nb = rcu_dereference(nb->next);
+ if (nl == &cpu_chain.head) {
+ sprint_symbol(name_buf, (unsigned long)nb->notifier_call);
+ printk("[HOTPLUG] calling callback:%s\n", name_buf);
+ }
ret = nb->notifier_call(nb, val, v);
+ if (nl == &cpu_chain.head) {
+ sprint_symbol(name_buf, (unsigned long)nb->notifier_call);
+ printk("[HOTPLUG] returned from callback:%s\n", name_buf);
+ }

if (nr_calls)
(*nr_calls)++;

> >
> > However, it remains in R< state and has not executed the
> > kthread_should_stop() instruction.
> >
> > I'm trying to probe further by inserting a few more printk's in there.
> >
> > Will post the findings in a couple of hours.
> >
> > Thanks for reporting the problem.
> >
> > Regards
> > gautham.
>

--
Thanks and Regards
gautham

Yi Yang

unread,

Mar 4, 2008, 12:30:11 AM3/4/08

to

On Mon, 2008-03-03 at 22:53 +0800, Yi Yang wrote:
> On Mon, 2008-03-03 at 13:02 +0100, Dmitry Adamushko wrote:
> > On 03/03/2008, Ingo Molnar <mi...@elte.hu> wrote:
> > >
> > > * Dmitry Adamushko <dmitry.a...@gmail.com> wrote:
> > >
> > > > per_cpu(watchdog_task, hotcpu) = NULL;
> > > > + mlseep(1);
> > >
> > >
> > > that wont build very well ...
> >
> > yeah, I forgot to mention that it's not even compile-tested :-/
> > I re-created it from scratch instead of looking for the original one.
> >
> > please, this one (again, not compile-tested)
> >
> > --- softlockup-prev-2.c 2008-03-03 12:38:36.000000000 +0100
> > +++ softlockup.c 2008-03-03 13:00:20.000000000 +0100
> > @@ -294,6 +294,7 @@ cpu_callback(struct notifier_block *nfb,
> > case CPU_DEAD_FROZEN:
> > p = per_cpu(watchdog_task, hotcpu);
> > per_cpu(watchdog_task, hotcpu) = NULL;
> > + msleep(1);
> > kthread_stop(p);
> > break;
> > #endif /* CONFIG_HOTPLUG_CPU */
>
> I don't think it can fix this issue, it only gives one chance to
> scheduler, i think there are another potential and very serious issues
> inside of scheduler or locking or what else we don't know.
>
> Maybe migration is a doubtful point as Gautham mentioned.

That issue is still there after the above patch is applied.

I found that [watchdog/#] is indeed migrated to other cpu because
migration_call is called before cpu_callback, i think this is the real
root cause very very possibly.

I suggest we can develop a new notifier infrastructure in which one
caller can specify whether it is kthread_stopping a cpu-bind kthread
so that such notifier callbacks can be executed prior to other
callbacks.

Gautham R Shenoy

unread,

Mar 4, 2008, 4:20:10 AM3/4/08

to

This is the hung_task_timeout message after a couple of cpu-offlines.

This is on 2.6.25-rc3.

INFO: task bash:4467 blocked for more than 120 seconds.

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

f3701dd0 00000046 f796aac0 f796aac0 f796abf8 cc434b80 00000000 f41ee940
0180b046 0000026e 00000016 00000000 00000008 f796b080 f796aac0 00000002
7fffffff 7fffffff f3701e1c f3701df8 c04e033a f3701e1c f3701dec c0139dec
Call Trace:
[<c04e033a>] schedule_timeout+0x16/0x8b
[<c0139dec>] ? trace_hardirqs_on+0xe9/0x111
[<c04e01c9>] wait_for_common+0xcf/0x12e
[<c011a3f0>] ? default_wake_function+0x0/0xd
[<c04e02aa>] wait_for_completion+0x12/0x14
[<c012ccbb>] flush_cpu_workqueue+0x50/0x66
[<c012cd28>] ? wq_barrier_func+0x0/0xd
[<c012cd14>] cleanup_workqueue_thread+0x43/0x57
[<c04c6f87>] workqueue_cpu_callback+0x8e/0xbd
[<c04e3975>] notifier_call_chain+0x2b/0x4a
[<c0132e9d>] __raw_notifier_call_chain+0xe/0x10
[<c0132eab>] raw_notifier_call_chain+0xc/0xe
[<c013e054>] _cpu_down+0x150/0x1ec
[<c013e133>] cpu_down+0x23/0x30
[<c02e3897>] store_online+0x27/0x5a
[<c02e3870>] ? store_online+0x0/0x5a
[<c02e09d8>] sysdev_store+0x20/0x25
[<c0196d2d>] sysfs_write_file+0xad/0xdf
[<c0196c80>] ? sysfs_write_file+0x0/0xdf
[<c0163da9>] vfs_write+0x8c/0x108
[<c0164333>] sys_write+0x3b/0x60
[<c01049da>] sysenter_past_esp+0x5f/0xa5
=======================
3 locks held by bash/4467:
#0: (&buffer->mutex){--..}, at: [<c0196ca5>] sysfs_write_file+0x25/0xdf
#1: (cpu_add_remove_lock){--..}, at: [<c013e10e>] cpu_maps_update_begin+0xf/0x11
#2: (cpu_hotplug_lock){----}, at: [<c013df5b>] _cpu_down+0x57/0x1ec

So it's not just a not reaping of watchdog thread issue.

I doubt it's due to some locking dependency since we have lockdep checks
in the workqueue code before we flush the cpu_workqueue.

Yi Yang

unread,

Mar 4, 2008, 4:50:11 AM3/4/08

to

You may "echo 1 > /proc/sys/kernel/sysrq" and "echo t
> /proc/sysrq-trigger", then check dmesg info, you can get
[watchdog/#]'s call stack which could give out where it is currently.

On my machine, that indicated [watchdog/1] is calling
sched_setscheduler. I doubt it is being killed before it is started and
woken up, this may result in some synchronization issues.

Oleg Nesterov

unread,

Mar 4, 2008, 10:00:22 AM3/4/08

to

On 03/04, Gautham R Shenoy wrote:
>
> So at times, the callback thread is blocked on kthread_stop(k) in
> softlockup.c, while other time, it was blocked in
> cleanup_workqueue_threads() in workqueue.c.

From another message:

>
> However, it remains in R< state

What about cwq->thread? Was it TASK_RUNNING too?

Perhaps, for some reason the task can't get CPU after migrating from
the now dead CPU.

I can't reproduce this problem on my one cpu P4-ht, perhaps you can
try something like the untested/uncompiled patch below?

Oleg.

--- include/linux/sched.h 2008-02-15 16:59:17.000000000 +0300
+++ include/linux/sched.h 2008-03-04 17:44:53.136738605 +0300
@@ -1121,6 +1121,7 @@ struct task_struct {
/* hung task detection */
unsigned long last_switch_timestamp;
unsigned long last_switch_count;
+ unsigned long xxx;
#endif
/* CPU-specific state of this task */
struct thread_struct thread;
--- kernel/fork.c 2008-02-15 16:59:17.000000000 +0300
+++ kernel/fork.c 2008-03-04 17:45:14.773033839 +0300
@@ -1097,6 +1097,7 @@ static struct task_struct *copy_process(
#ifdef CONFIG_DETECT_SOFTLOCKUP
p->last_switch_count = 0;
p->last_switch_timestamp = 0;
+ p->xxx = 0;
#endif

#ifdef CONFIG_TASK_XACCT
--- kernel/sched.c 2008-02-15 16:59:17.000000000 +0300
+++ kernel/sched.c 2008-03-04 17:48:42.308798646 +0300
@@ -1291,6 +1291,7 @@ static void enqueue_task(struct rq *rq,
sched_info_queued(p);
p->sched_class->enqueue_task(rq, p, wakeup);
p->se.on_rq = 1;
+ p->xxx = jiffies | 1;
}

static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
@@ -3944,6 +3945,8 @@ need_resched_nonpreemptible:
preempt_enable_no_resched();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
+
+ current->xxx = 0;
}
EXPORT_SYMBOL(schedule);

--- kernel/softlockup.c 2008-02-15 16:59:17.000000000 +0300
+++ kernel/softlockup.c 2008-03-04 17:49:05.584414763 +0300
@@ -174,6 +174,27 @@ static void check_hung_task(struct task_
touch_nmi_watchdog();
}

+static void check_running_task(struct task_struct *t, unsigned long now)
+{
+ if (!sysctl_hung_task_timeout_secs)
+ return;
+
+ if (time_before(now, t->xxx + HZ * sysctl_hung_task_timeout_secs)
+ return;
+
+ printk(KERN_ERR "INFO: task %s:%d can't get CPU for more than "
+ "%ld seconds.\n", t->comm, t->pid,
+ sysctl_hung_task_timeout_secs);
+
+ if (!cpus_intersects(t->cpus_allowed, cpu_online_map))
+ printk(KERN_ERR "bad ->cpus_allowed\n");
+ if (!cpu_online(task_cpu(t)))
+ printk(KERN_ERR "bad ->cpu\n");
+
+ sched_show_task(t);
+ touch_nmi_watchdog();
+}
+
/*
* Check whether a TASK_UNINTERRUPTIBLE does not get woken up for
* a really long time (120 seconds). If that happens, print out
@@ -183,6 +204,7 @@ static void check_hung_uninterruptible_t
{
int max_count = sysctl_hung_task_check_count;
unsigned long now = get_timestamp(this_cpu);
+ unsigned long jiff = jiffies;
struct task_struct *g, *t;

/*
@@ -192,15 +214,17 @@ static void check_hung_uninterruptible_t
if ((tainted & TAINT_DIE) || did_panic)
return;

- read_lock(&tasklist_lock);
+ rcu_read_lock();
do_each_thread(g, t) {
if (!--max_count)
goto unlock;
if (t->state & TASK_UNINTERRUPTIBLE)
check_hung_task(t, now);
+ if (!t->xxx)
+ check_running_task(t, jiff);
} while_each_thread(g, t);
unlock:
- read_unlock(&tasklist_lock);
+ rcu_read_unlock();
}

/*

Yi Yang

unread,

Mar 4, 2008, 9:30:12 PM3/4/08

to

On Tue, 2008-03-04 at 18:01 +0300, Oleg Nesterov wrote:
> On 03/04, Gautham R Shenoy wrote:
> >
> > So at times, the callback thread is blocked on kthread_stop(k) in
> > softlockup.c, while other time, it was blocked in
> > cleanup_workqueue_threads() in workqueue.c.
>
> From another message:
> >
> > However, it remains in R< state
>
> What about cwq->thread? Was it TASK_RUNNING too?
>
> Perhaps, for some reason the task can't get CPU after migrating from
> the now dead CPU.
>
> I can't reproduce this problem on my one cpu P4-ht, perhaps you can
> try something like the untested/uncompiled patch below?

You need one multiprocessor system to regenerate, 4 CPUs or more are
best.

At the beginning, i thought Detect Soft Lockups
(CONFIG_DETECT_SOFTLOCKUP) is culprit because it added much code to
detect lock dependencies and deadlocks, but the issue is still there
after i disabled soft lock detection, it can be regenerated but will
take much long time.

At this time, group_balance and kblockd are in "R<" state.

[root@harwich-rh ~]# ps aux | grep kstop
root 15466 0.0 0.0 0 0 ? S< 01:03 0:00 [kstopmachine]
root 15467 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15468 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15469 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15470 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15471 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15472 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15473 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15474 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15475 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15476 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 17155 0.0 0.0 61148 708 pts/3 R+ 14:47 0:00 grep kstop
[root@harwich-rh ~]# ps aux | grep "R<"
root 35 0.0 0.0 0 0 ? R< Mar13 0:00 [group_balance]
root 182 0.1 0.0 0 0 ? R< Mar13 1:09 [kblockd/0]
root 17157 0.0 0.0 61148 704 pts/3 R+ 14:47 0:00 grep R<
[root@harwich-rh ~]#

Gautham R Shenoy

unread,

Mar 5, 2008, 5:10:09 AM3/5/08

to

On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> On 03/04, Gautham R Shenoy wrote:
> >
> > So at times, the callback thread is blocked on kthread_stop(k) in
> > softlockup.c, while other time, it was blocked in
> > cleanup_workqueue_threads() in workqueue.c.
>
> >From another message:
> >
> > However, it remains in R< state
>
> What about cwq->thread? Was it TASK_RUNNING too?

No, it was in TASK_UNINTERRUPTIBLE state. The last thing it ever
executed was the wait_for_completion in flush_cpu_workqueue()

--
Thanks and Regards
gautham

Oleg Nesterov

unread,

Mar 5, 2008, 9:00:19 AM3/5/08

to

On 03/05, Gautham R Shenoy wrote:
>
> On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> > On 03/04, Gautham R Shenoy wrote:
> > >
> > > So at times, the callback thread is blocked on kthread_stop(k) in
> > > softlockup.c, while other time, it was blocked in
> > > cleanup_workqueue_threads() in workqueue.c.
> >
> > >From another message:
> > >
> > > However, it remains in R< state
> >
> > What about cwq->thread? Was it TASK_RUNNING too?
>
> No, it was in TASK_UNINTERRUPTIBLE state. The last thing it ever
> executed was the wait_for_completion in flush_cpu_workqueue()

You misunderstood ;) The task doing _cpu_down() hang in flush_cpu_workqueue(),
because cwq->thread doesn't want to die. But what was cwq->thread->state?

Could you try the patch? Yi constantly sees the tasks in R< state.
I am starting to suspect some scheduling issue.

Oleg.

Gautham R Shenoy

unread,

Mar 6, 2008, 6:20:10 AM3/6/08

to

On Wed, Mar 05, 2008 at 04:53:14PM +0300, Oleg Nesterov wrote:
> On 03/05, Gautham R Shenoy wrote:
> >
> > On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> > > On 03/04, Gautham R Shenoy wrote:
> > > >
> > > > So at times, the callback thread is blocked on kthread_stop(k) in
> > > > softlockup.c, while other time, it was blocked in
> > > > cleanup_workqueue_threads() in workqueue.c.
> > >
> > > >From another message:
> > > >
> > > > However, it remains in R< state
> > >
> > > What about cwq->thread? Was it TASK_RUNNING too?
> >
> > No, it was in TASK_UNINTERRUPTIBLE state. The last thing it ever
> > executed was the wait_for_completion in flush_cpu_workqueue()
>
> You misunderstood ;) The task doing _cpu_down() hang in flush_cpu_workqueue(),
> because cwq->thread doesn't want to die. But what was cwq->thread->state?
>

Yup, sorry for that stupid reply.

> Could you try the patch? Yi constantly sees the tasks in R< state.
> I am starting to suspect some scheduling issue.

Me too. With your patch applied there were quite a few tasks in the
running state which didn't get the cpu for more than 120 seconds.

I could be a scheduler problem which CPU-Hotplug is able to catch more
easily. Will try a git bisect.

>
> Oleg.

--
Thanks and Regards
gautham

Gautham R Shenoy

unread,

Mar 6, 2008, 7:30:15 AM3/6/08

to

On Thu, Mar 06, 2008 at 04:45:25PM +0530, Gautham R Shenoy wrote:
> On Wed, Mar 05, 2008 at 04:53:14PM +0300, Oleg Nesterov wrote:
> > On 03/05, Gautham R Shenoy wrote:
> > >
> > > On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> > > > On 03/04, Gautham R Shenoy wrote:
> > > > >
> > > > > So at times, the callback thread is blocked on kthread_stop(k) in
> > > > > softlockup.c, while other time, it was blocked in
> > > > > cleanup_workqueue_threads() in workqueue.c.
> > > >
> > > > >From another message:
> > > > >
> > > > > However, it remains in R< state
> > > >
> > > > What about cwq->thread? Was it TASK_RUNNING too?
> > >
> > > No, it was in TASK_UNINTERRUPTIBLE state. The last thing it ever
> > > executed was the wait_for_completion in flush_cpu_workqueue()
> >
> > You misunderstood ;) The task doing _cpu_down() hang in flush_cpu_workqueue(),
> > because cwq->thread doesn't want to die. But what was cwq->thread->state?
> >
>
> Yup, sorry for that stupid reply.
>
> > Could you try the patch? Yi constantly sees the tasks in R< state.
> > I am starting to suspect some scheduling issue.
>
> Me too. With your patch applied there were quite a few tasks in the
> running state which didn't get the cpu for more than 120 seconds.
>
> I could be a scheduler problem which CPU-Hotplug is able to catch more
> easily. Will try a git bisect.

BTW, with your patch, I get the "task can't get CPU for more than
120 seconds" message even while running a kernbench on 2.6.25-rc2.

Gautham R Shenoy

unread,

Mar 6, 2008, 8:50:14 AM3/6/08

to

On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:

This function gets called only when t->xxx == 0,
so the if below doesn't mean much, does it? :)

--
Thanks and Regards
gautham

Oleg Nesterov

unread,

Mar 6, 2008, 10:00:10 PM3/6/08

to

On 03/06, Gautham R Shenoy wrote:
>
> On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> > +static void check_running_task(struct task_struct *t, unsigned long now)
> > +{
> > + if (!sysctl_hung_task_timeout_secs)
> > + return;
> > +
>
> This function gets called only when t->xxx == 0,
> so the if below doesn't mean much, does it? :)
>
> > + if (time_before(now, t->xxx + HZ * sysctl_hung_task_timeout_secs)
> > + return;

> .......

>
> > @@ -192,15 +214,17 @@ static void check_hung_uninterruptible_t
> > if ((tainted & TAINT_DIE) || did_panic)
> > return;
> >
> > - read_lock(&tasklist_lock);
> > + rcu_read_lock();
> > do_each_thread(g, t) {
> > if (!--max_count)
> > goto unlock;
> > if (t->state & TASK_UNINTERRUPTIBLE)
> > check_hung_task(t, now);
> > + if (!t->xxx)
> > + check_running_task(t, jiff);

Of course, the check above should be

if (1t->xxx)
check_running_task(t, jiff);

Thanks!

From another message,

>
> Me too. With your patch applied there were quite a few tasks in the

> running state which didn't get the cpu for more than 120 seconds.

(I assume you fixed the patch before using it ;)

Just to be sure, there were no "bad ->cpu..." messages, yes?

Oleg.

Yi Yang

unread,

Mar 7, 2008, 2:50:10 AM3/7/08

to

I found priority of some processes is always 0 but they are highly
related to this kind of deadlock.

What does Priority 0 mean? When and in which case is it's priority
changed to 0? When can it's priority be restored to one higher value.

A simple script to get priority 0:

===================================================================
#!/bin/sh

for file in `ls /proc/*/sched`
do
prio=`grep prio $file | awk '{print $3;}'`
if [ "X$prio" == "X0" ] ; then
echo "`head -n 1 $file | awk '{print $1}'`: $prio"
fi
done
==================================================================

[root@harwich-rh ~]# ./lsprio
migration/3: 0
migration/4: 0
migration/5: 0
migration/6: 0
migration/7: 0
migration/8: 0
migration/9: 0
migration/10: 0
migration/11: 0
migration/12: 0
migration/0: 0
migration/13: 0
migration/14: 0
migration/15: 0
grep: /proc/5244/sched: No such file or directory
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
watchdog/1: 0
migration/2: 0

[root@harwich-rh ~]# ps aux | grep "R<"

root 8 0.0 0.0 0 0 ? R< 19:29 0:00
[watchdog/1]
root 6987 0.0 0.0 61144 708 pts/2 R+ 20:16 0:00 grep R<
[root@harwich-rh ~]#

--

Gautham R Shenoy

unread,

Mar 7, 2008, 4:30:14 AM3/7/08

to

No! Conversely, I fixed the patch because I found this behaviour a bit
odd. Couldn't run the tests again as it was a tad bit late.

>
> Just to be sure, there were no "bad ->cpu..." messages, yes?

Hopefully should be able to catch them now. If yes, it's a problem in
the way we do migration after cpu-hotplug as Yi suggested in an earlier
mail.

http://lkml.org/lkml/2008/3/6/437

This mail from akpm says the same thing.

>
> Oleg.

--
Thanks and Regards
gautham

Gautham R Shenoy

unread,

Mar 7, 2008, 6:00:19 AM3/7/08

to

On Fri, Mar 07, 2008 at 02:40:49PM +0530, Gautham R Shenoy wrote:
>
> >
> > Just to be sure, there were no "bad ->cpu..." messages, yes?
>
> Hopefully should be able to catch them now. If yes, it's a problem in
> the way we do migration after cpu-hotplug as Yi suggested in an earlier
> mail.
>
> http://lkml.org/lkml/2008/3/6/437
>
> This mail from akpm says the same thing.

Yup! There are a quite a few "bad->cpu" messages.
All of them only for watchdog/1.

What surprises me is the fact that the first of task-hung messages come
after 136 successful cpu-hotplug attempts.

To answer Andrew's question, migration_call() *should* ideally be the
first notifier called, because it's registered with the priority 10.

Unless there's some change that has recently gone in the
notifier_call_chain subsystem that doesn't honour the registration priority
anymore, migration_call() will be called first,
which means the task affinity for watchdog thread is broken and hence
we shouldn't see the message below.

I'll probe to see if there's a change in the call order that's causing
this hang.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
INFO: task watchdog/1:10958 can't get CPU for more than 120 seconds.
bad ->cpu
f524ffac 00000046 c011befc f5276aa0 f5276bd8 c443db80 00000001 f6130b80
f524ff94 00000246 c07130f4 03d28000 00000000 f524ff9c c01438ad f524ffac
00000001 c0143a1f 00000000 f524ffd0 c0143a66 00000001 c0119fea 00000000
Call Trace:
[<c011befc>] ? cpu_clock+0x4e/0x59
[<c01438ad>] ? get_timestamp+0x8/0x11
[<c0143a1f>] ? watchdog+0x0/0x239
[<c0143a66>] watchdog+0x47/0x239
[<c0119fea>] ? complete+0x34/0x3e
[<c0143a1f>] ? watchdog+0x0/0x239
[<c012f797>] kthread+0x3b/0x64
[<c012f75c>] ? kthread+0x0/0x64
[<c01056eb>] kernel_thread_helper+0x7/0x10
=======================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yi Yang

unread,

Mar 7, 2008, 6:10:13 AM3/7/08

to

On Fri, 2008-03-07 at 16:21 +0530, Gautham R Shenoy wrote:
> On Fri, Mar 07, 2008 at 02:40:49PM +0530, Gautham R Shenoy wrote:
> >
> > >
> > > Just to be sure, there were no "bad ->cpu..." messages, yes?
> >
> > Hopefully should be able to catch them now. If yes, it's a problem in
> > the way we do migration after cpu-hotplug as Yi suggested in an earlier
> > mail.
> >
> > http://lkml.org/lkml/2008/3/6/437
> >
> > This mail from akpm says the same thing.
>
> Yup! There are a quite a few "bad->cpu" messages.
> All of them only for watchdog/1.
>
> What surprises me is the fact that the first of task-hung messages come
> after 136 successful cpu-hotplug attempts.
>
> To answer Andrew's question, migration_call() *should* ideally be the
> first notifier called, because it's registered with the priority 10.
>
> Unless there's some change that has recently gone in the
> notifier_call_chain subsystem that doesn't honour the registration priority
> anymore, migration_call() will be called first,
> which means the task affinity for watchdog thread is broken and hence
> we shouldn't see the message below.
>
> I'll probe to see if there's a change in the call order that's causing
> this hang.

I have traced all the cpu callback functions, i ensure migration_call is
called firstly, i have ever set notifier priority of cpu_callback in
kernel/softlockup.c to maximum int and check what will happen, the
result is same.

>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> INFO: task watchdog/1:10958 can't get CPU for more than 120 seconds.
> bad ->cpu
> f524ffac 00000046 c011befc f5276aa0 f5276bd8 c443db80 00000001 f6130b80
> f524ff94 00000246 c07130f4 03d28000 00000000 f524ff9c c01438ad f524ffac
> 00000001 c0143a1f 00000000 f524ffd0 c0143a66 00000001 c0119fea 00000000
> Call Trace:
> [<c011befc>] ? cpu_clock+0x4e/0x59
> [<c01438ad>] ? get_timestamp+0x8/0x11
> [<c0143a1f>] ? watchdog+0x0/0x239
> [<c0143a66>] watchdog+0x47/0x239
> [<c0119fea>] ? complete+0x34/0x3e
> [<c0143a1f>] ? watchdog+0x0/0x239
> [<c012f797>] kthread+0x3b/0x64
> [<c012f75c>] ? kthread+0x0/0x64
> [<c01056eb>] kernel_thread_helper+0x7/0x10
> =======================
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
> >
> > >
> > > Oleg.
> >
> > --
> > Thanks and Regards
> > gautham
>

--

Dmitry Adamushko

unread,

Mar 7, 2008, 8:10:17 AM3/7/08

to

Hi,

'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
move RT tasks between cpus anymore and there is a special mechanism in
scher_rt.c instead (I think, it's .25 material).

So I wonder, whether __migrate_task() is still capable of properly
moving a RT task to another CPU (e.g. for the case when it's in
TASK_RUNNING state) without breaking something in the rt migration
mechanism (or whatever else) that would leave us with a runqueue in
the 'inconsistent' state...
(I've taken a quick look at the relevant code so can't confirm it yet)

maybe it'd be faster if somebody could do a quick test now with the
following line commented out in kernel/softlockup.c :: watchdog()

- sched_setscheduler(current, SCHED_FIFO, &param);

--
Best regards,
Dmitry Adamushko

Gautham R Shenoy

unread,

Mar 7, 2008, 9:00:23 AM3/7/08

to

On Fri, Mar 07, 2008 at 02:02:20PM +0100, Dmitry Adamushko wrote:
> Hi,
>
> 'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
> move RT tasks between cpus anymore and there is a special mechanism in
> scher_rt.c instead (I think, it's .25 material).
>
> So I wonder, whether __migrate_task() is still capable of properly
> moving a RT task to another CPU (e.g. for the case when it's in
> TASK_RUNNING state) without breaking something in the rt migration
> mechanism (or whatever else) that would leave us with a runqueue in
> the 'inconsistent' state...
> (I've taken a quick look at the relevant code so can't confirm it yet)
>
> maybe it'd be faster if somebody could do a quick test now with the
> following line commented out in kernel/softlockup.c :: watchdog()
>
> - sched_setscheduler(current, SCHED_FIFO, &param);

Commenting out that like seems to work. Passed 500 iterations of
cpu-hotplug without any problems.

>
>
> --
> Best regards,
> Dmitry Adamushko

--
Thanks and Regards
gautham

Suresh Siddha

unread,

Mar 7, 2008, 2:20:09 PM3/7/08

to

On Fri, Mar 07, 2008 at 07:50:02AM -0800, Gautham R Shenoy wrote:

> On Fri, Mar 07, 2008 at 07:25:37PM +0530, Gautham R Shenoy wrote:
> > On Fri, Mar 07, 2008 at 02:02:20PM +0100, Dmitry Adamushko wrote:
> > > Hi,
> > >
> > > 'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
> > > move RT tasks between cpus anymore and there is a special mechanism in
> > > scher_rt.c instead (I think, it's .25 material).
> > >
> > > So I wonder, whether __migrate_task() is still capable of properly
> > > moving a RT task to another CPU (e.g. for the case when it's in
> > > TASK_RUNNING state) without breaking something in the rt migration
> > > mechanism (or whatever else) that would leave us with a runqueue in
> > > the 'inconsistent' state...
> > > (I've taken a quick look at the relevant code so can't confirm it yet)
> > >
> > > maybe it'd be faster if somebody could do a quick test now with the
> > > following line commented out in kernel/softlockup.c :: watchdog()
> > >
> > > - sched_setscheduler(current, SCHED_FIFO, &param);
> >
> > Commenting out that like seems to work. Passed 500 iterations of
> > cpu-hotplug without any problems.
>

> This seems to unearth another problem. After some 850 successful
> cpu-hotplug iterations I got this message.
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ------------[ cut here ]------------
> BUG: spinlock recursion on CPU#2, kstopmachine/32521
> BUG: spinlock lockup on CPU#1, kstopmachine/32517, cc43db80
> Pid: 32517, comm: kstopmachine Not tainted 2.6.25-rc3 #44
> [<c0284ebb>] <0>BUG: spinlock lockup on CPU#0, kstopmachine/32520,
> cc43db80
> _raw_spin_lock+0xd5/0xf9
> [<c04e1a33>] <0>BUG: spinlock lockup on CPU#3, kstopmachine/32522,
> cc43db80
> _spin_lock+0x20/0x28
> Pid: 32522, comm: kstopmachine Not tainted 2.6.25-rc3 #44
> [<c04dfa41>] ? [<c0284ebb>] schedule+0xb0/0x5ab
> [<c04dfa41>] schedule+0xb0/0x5ab
> _raw_spin_lock+0xd5/0xf9
> [<c04e1d58>] ? [<c04e1a33>] _spin_unlock_irqrestore+0x36/0x3c
> [<c0119fdc>] ? _spin_lock+0x20/0x28
> [<c0119806>] ? complete+0x34/0x3e
> [<c0143646>] double_lock_balance+0x3a/0x57
> [<c0119806>] do_stop+0xd4/0xfe

Well, there is another

sched_setscheduler(p, SCHED_FIFO, &param);

in kernel/stop_machine.c

Perhaps we need to remove this aswell and try?

thanks,
suresh

Andrew Morton

unread,

Mar 7, 2008, 3:30:16 PM3/7/08

to

On Fri, 7 Mar 2008 14:02:20 +0100
"Dmitry Adamushko" <dmitry.a...@gmail.com> wrote:

> Hi,
>
> 'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
> move RT tasks between cpus anymore and there is a special mechanism in
> scher_rt.c instead (I think, it's .25 material).
>
> So I wonder, whether __migrate_task() is still capable of properly
> moving a RT task to another CPU (e.g. for the case when it's in
> TASK_RUNNING state) without breaking something in the rt migration
> mechanism (or whatever else) that would leave us with a runqueue in
> the 'inconsistent' state...
> (I've taken a quick look at the relevant code so can't confirm it yet)
>
> maybe it'd be faster if somebody could do a quick test now with the
> following line commented out in kernel/softlockup.c :: watchdog()
>
> - sched_setscheduler(current, SCHED_FIFO, &param);
>

Yup, thanks. This:

kernel/softirq.c | 2 +-
kernel/softlockup.c | 2 +-
kernel/stop_machine.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)

diff -puN kernel/softlockup.c~a kernel/softlockup.c
--- a/kernel/softlockup.c~a
+++ a/kernel/softlockup.c
@@ -211,7 +211,7 @@ static int watchdog(void *__bind_cpu)
struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
int this_cpu = (long)__bind_cpu;

- sched_setscheduler(current, SCHED_FIFO, &param);
+// sched_setscheduler(current, SCHED_FIFO, &param);

/* initialize timestamp */
touch_softlockup_watchdog();
diff -puN kernel/stop_machine.c~a kernel/stop_machine.c
--- a/kernel/stop_machine.c~a
+++ a/kernel/stop_machine.c
@@ -188,7 +188,7 @@ struct task_struct *__stop_machine_run(i
struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };

/* One high-prio thread per cpu. We'll do this one. */
- sched_setscheduler(p, SCHED_FIFO, &param);
+// sched_setscheduler(p, SCHED_FIFO, &param);
kthread_bind(p, cpu);
wake_up_process(p);
wait_for_completion(&smdata.done);
diff -puN kernel/softirq.c~a kernel/softirq.c
--- a/kernel/softirq.c~a
+++ a/kernel/softirq.c
@@ -622,7 +622,7 @@ static int __cpuinit cpu_callback(struct

p = per_cpu(ksoftirqd, hotcpu);
per_cpu(ksoftirqd, hotcpu) = NULL;
- sched_setscheduler(p, SCHED_FIFO, &param);
+// sched_setscheduler(p, SCHED_FIFO, &param);
kthread_stop(p);
takeover_tasklets(hotcpu);
break;
_

fixes the wont-power-off regression.

But 2.6.24 runs the watchdog threads SCHED_FIFO too. Are you saying that
it's the migration code which changed?

Rafael J. Wysocki

unread,

Mar 7, 2008, 4:40:14 PM3/7/08

to

Well, that would be a problem for suspend/hibernation if there were SCHED_FIFO
non-freezable tasks bound to the nonboot CPUs. I'm not aware of any, but ...

Also, it should be possible to just offline one or more CPUs using the sysfs
interface at any time and what happens if there are any RT tasks bound to these
CPUs at that time? That would be the $subject problem, IMHO.

Anyone who made the change affecting __migrate_task() so it's unable to migrate
RT tasks any more should have fixed the CPU hotplug as well.

Suresh Siddha

unread,

Mar 7, 2008, 6:10:10 PM3/7/08

to

No. Its not the issue with __migrate_task(). Appended patch fixes my issue.
Recent RT wakeup balance code changes exposed a bug in migration_call() code
path.

Andrew, Please check if the appended patch fixes your power-off problem aswell.

thanks,
suresh
---

Handle the `CPU_DOWN_PREPARE_FROZEN' case in migration_call().

Otherwise, without this, we don't clear the cpu going down in the
root domains "online" mask. This was causing the RT tasks to be woken
up on already dead cpus, causing system hangs during standby, shutdown etc.

For example, on my system, this is the failing sequence:

kthread_stop() // coming from the cpu_callback's
wake_up_process()
sched_class->select_task_rq(); //select_task_rq_rt
find_lowest_rq
find_lowest_cpus
cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);

In my case tasks->cpus_allowed is set to cpu_possible_map and because of the
this bug, rd->online still shows the dead cpu. Resulting in
find_lowest_rq() return an offlined cpu, because of which RT task gets woken
up on a DEAD cpu, causing various hangs.

This issue doesn't happen with normal tasks because, the select_task_rq_fair()
chooses between only two cpu's (the cpu which is waking up the task or last run
cpu (task_cpu()), kernel hotplug code makes sure that both of which always
represent the online CPU's).

Why it didn't show up in 2.6.24? Because the new wakeup code is using a
complex CPU selection logic in select_task_rq_rt() which exposed this
bug in migration_call()

Signed-off-by: Suresh Siddha <suresh....@intel.com>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..60550d8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
break;

case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
/* Update our root-domain */
rq = cpu_rq(cpu);
spin_lock_irqsave(&rq->lock, flags);

Andrew Morton

unread,

Mar 7, 2008, 6:40:07 PM3/7/08

to

On Fri, 7 Mar 2008 15:01:26 -0800
Suresh Siddha <suresh....@intel.com> wrote:

>
> Andrew, Please check if the appended patch fixes your power-off problem aswell.

> ...

>
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> break;
>
> case CPU_DOWN_PREPARE:
> + case CPU_DOWN_PREPARE_FROZEN:
> /* Update our root-domain */
> rq = cpu_rq(cpu);
> spin_lock_irqsave(&rq->lock, flags);

No, it does not.

Rafael J. Wysocki

unread,

Mar 7, 2008, 6:50:06 PM3/7/08

to

On Saturday, 8 of March 2008, Andrew Morton wrote:
> On Fri, 7 Mar 2008 15:01:26 -0800
> Suresh Siddha <suresh....@intel.com> wrote:
>
> >
> > Andrew, Please check if the appended patch fixes your power-off problem aswell.
> > ...
> >
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> > break;
> >
> > case CPU_DOWN_PREPARE:
> > + case CPU_DOWN_PREPARE_FROZEN:
> > /* Update our root-domain */
> > rq = cpu_rq(cpu);
> > spin_lock_irqsave(&rq->lock, flags);
>
> No, it does not.

Well, this is a bug nevertheless.

In fact, the frozen bit should only be set during hibernation/suspend and not
during shutdown ...

So, it looks like disable_nonboot_cpus() needs an extra parameter.

Suresh Siddha

unread,

Mar 7, 2008, 9:00:09 PM3/7/08

to

On Sat, Mar 08, 2008 at 12:43:15AM +0100, Rafael J. Wysocki wrote:
> On Saturday, 8 of March 2008, Andrew Morton wrote:
> > On Fri, 7 Mar 2008 15:01:26 -0800
> > Suresh Siddha <suresh....@intel.com> wrote:
> >
> > >
> > > Andrew, Please check if the appended patch fixes your power-off problem aswell.
> > > ...
> > >
> > > --- a/kernel/sched.c
> > > +++ b/kernel/sched.c
> > > @@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> > > break;
> > >
> > > case CPU_DOWN_PREPARE:
> > > + case CPU_DOWN_PREPARE_FROZEN:
> > > /* Update our root-domain */
> > > rq = cpu_rq(cpu);
> > > spin_lock_irqsave(&rq->lock, flags);
> >
> > No, it does not.
>
> Well, this is a bug nevertheless.
>

Well, my previous root cause needs some small changes.

During the notifier call chain for CPU_DOWN(till 'update_sched_domains'
is called atleast), all the cpu's are attached to 'def_root_domain', for whom
online mask still has the offline cpu.

This is because, during CPU_DOWN_PREPARE, migration_call() first clears
the root_domain->online, and later during the DOWN_PREPARE call chain
detach_destroy_domains() attach to def_root_domain with cpu_online_map(which
still has the just about to die 'cpu' set).

So essentially, during the notifier call chain of CPU_DOWN (before
'update_sched_domains' is called atleast), any one doing RT process
wakeup's (for example: kthread_stop()) can still end up on the dead cpu.

Andrew, Can you please try one more patch(appended) to see if it helps?

Signed-off-by: Suresh Siddha <suresh....@intel.com>
---

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 0a6d2e5..8cb707c 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -597,7 +597,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
int count = 0;
int cpu;

- cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
+ cpus_and(*lowest_mask, task->cpus_allowed, cpu_online_map);

/*
* Scan each rq for the lowest prio.

Andrew Morton

unread,

Mar 7, 2008, 9:20:07 PM3/7/08

to

On Fri, 7 Mar 2008 17:50:45 -0800 Suresh Siddha <suresh....@intel.com> wrote:

> Andrew, Can you please try one more patch(appended) to see if it helps?
>

Not until Monday, sorry - the offending machine is at work and (due to
this bug) cannot be woken via LAN.

> ---
>
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 0a6d2e5..8cb707c 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -597,7 +597,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
> int count = 0;
> int cpu;
>
> - cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
> + cpus_and(*lowest_mask, task->cpus_allowed, cpu_online_map);

Gregory Haskins

unread,

Mar 8, 2008, 12:50:07 AM3/8/08

to

Hi Suresh,
Unfortunately, this patch will introduce its own set of bugs.
However, your analysis was spot-on. I think I see the problem now. It
was introduced when I put a hack in to "fix" s2ram problems in -mm as a
result of the new root-domain logic. I think the following patch will
fix both issues:

(I verified that I could take a cpu offline/online, but I don't have an
s2ram compatible machine handy. Andrew, I believe you could reproduce
the s2ram problem a few months ago when that issue popped up. So if you
could, please verify that s2ram also works with this patch applied, in
addition to the hotplug problem.

Regards,
-Greg

-------------------------------------------------

adjust root-domain->online span in response to hotplug event

We currently set the root-domain online span automatically when the domain
is added to the cpu if the cpu is already a member of cpu_online_map.
This was done as a hack/bug-fix for s2ram, but it also causes a problem
with hotplug CPU_DOWN transitioning. The right way to fix the original
problem is to actually respond to CPU_UP events, instead of CPU_ONLINE,
which is already too late.

Signed-off-by: Gregory Haskins <ghas...@novell.com>
---

kernel/sched.c | 18 +++++++-----------
1 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..b02e4fc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5813,6 +5813,13 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
/* Must be high prio: stop_machine expects to yield to it. */
rq = task_rq_lock(p, &flags);
__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
+
+ /* Update our root-domain */
+ if (rq->rd) {
+ BUG_ON(!cpu_isset(cpu, rq->rd->span));
+ cpu_set(cpu, rq->rd->online);
+ }
+
task_rq_unlock(rq, &flags);
cpu_rq(cpu)->migration_thread = p;
break;
@@ -5821,15 +5828,6 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
case CPU_ONLINE_FROZEN:
/* Strictly unnecessary, as first user will wake it. */
wake_up_process(cpu_rq(cpu)->migration_thread);
-
- /* Update our root-domain */
- rq = cpu_rq(cpu);
- spin_lock_irqsave(&rq->lock, flags);
- if (rq->rd) {
- BUG_ON(!cpu_isset(cpu, rq->rd->span));
- cpu_set(cpu, rq->rd->online);
- }
- spin_unlock_irqrestore(&rq->lock, flags);
break;

#ifdef CONFIG_HOTPLUG_CPU
@@ -6105,8 +6103,6 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
rq->rd = rd;

cpu_set(rq->cpu, rd->span);
- if (cpu_isset(rq->cpu, cpu_online_map))
- cpu_set(rq->cpu, rd->online);

for (class = sched_class_highest; class; class = class->next) {
if (class->join_domain)

Ingo Molnar

unread,

Mar 8, 2008, 3:50:09 AM3/8/08

to

* Gregory Haskins <ghas...@novell.com> wrote:

> Unfortunately, this patch will introduce its own set of bugs.
> However, your analysis was spot-on. I think I see the problem now.
> It was introduced when I put a hack in to "fix" s2ram problems in -mm
> as a result of the new root-domain logic. I think the following patch
> will fix both issues:
>
> (I verified that I could take a cpu offline/online, but I don't have
> an s2ram compatible machine handy. Andrew, I believe you could
> reproduce the s2ram problem a few months ago when that issue popped
> up. So if you could, please verify that s2ram also works with this
> patch applied, in addition to the hotplug problem.

thanks Gregory, i've queued up your fix. If it passes all tests over the
day i'll send it to Linus in about 12 hours.

Ingo

Gregory Haskins

unread,

Mar 8, 2008, 1:00:16 PM3/8/08

to

>>> On Sat, Mar 8, 2008 at 3:41 AM, in message <20080308084...@elte.hu>,
Ingo Molnar <mi...@elte.hu> wrote:

> * Gregory Haskins <ghas...@novell.com> wrote:
>
>> Unfortunately, this patch will introduce its own set of bugs.
>> However, your analysis was spot-on. I think I see the problem now.
>> It was introduced when I put a hack in to "fix" s2ram problems in -mm
>> as a result of the new root-domain logic. I think the following patch
>> will fix both issues:
>>
>> (I verified that I could take a cpu offline/online, but I don't have
>> an s2ram compatible machine handy. Andrew, I believe you could
>> reproduce the s2ram problem a few months ago when that issue popped
>> up. So if you could, please verify that s2ram also works with this
>> patch applied, in addition to the hotplug problem.
>
> thanks Gregory, i've queued up your fix. If it passes all tests over the
> day i'll send it to Linus in about 12 hours.
>

After thinking about it some more, I am not sure if I got this fix quite right. The first two hunks are technically fine and should ultimately go in. The last hunk is questionable.

Ultimately, I believe the root cause of these reported issues is that cpu_online_map and rd->online can get out of sync. I fear that while the current patch may fix the hotplug/s2ram case, it potentially breaks the disjoint cpuset topology change case. I will run a few cpuset tests ASAP to confirm, though I may not have a chance until Monday.

What I probably need is to tie the code that sets/clears cpu_online_map to the root-domain rd->online map somehow. However, a cursory cscope inspection has failed to reveal the location that this cpu_online_map manipulation occurs. If anyone knows where cpu_online_map is actually updated, please let me know. Otherwise, let me think about this some more and get back to you.

In the meantime, cpu_and'ing "task->cpus_allowed & cpu_online_map & rd->online" in the place that Suresh identified should technically fix the hotplug issue without adding more bugs, at the expense of requiring two calls to cpu_and(). The expense should be more or less negligible on mainstream kernels, but may hurt our big-iron (e.g. SGI) brethren. Hopefully I will come up with a real fix shortly, before it becomes a real issue.

Regards,
-Greg

Suresh Siddha

unread,

Mar 8, 2008, 9:40:07 PM3/8/08

to

On Sat, Mar 08, 2008 at 12:10:15AM -0500, Gregory Haskins wrote:
> Suresh Siddha wrote:
> >
> > - cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
> > + cpus_and(*lowest_mask, task->cpus_allowed, cpu_online_map);
>

> Hi Suresh,
> Unfortunately, this patch will introduce its own set of bugs.

Is that because of the missing get/put_online_cpus() ?

> However, your analysis was spot-on. I think I see the problem now. It
> was introduced when I put a hack in to "fix" s2ram problems in -mm as a
> result of the new root-domain logic. I think the following patch will
> fix both issues:

BTW, what is use of per root domains online map, when we have global
cpu_online_map() ? Each domains span & cpu_online_map should give
domain_online_map anyhow.

And is n't it buggy if someone accesses rd->online with out get/put_online_map()

thanks,
suresh

Gautham R Shenoy

unread,

Mar 10, 2008, 4:20:10 AM3/10/08

to

On Sat, Mar 08, 2008 at 12:10:15AM -0500, Gregory Haskins wrote:

Hi Greg,

Suppose someone issues a wakeup at some point between CPU_UP_PREPARE
and CPU_ONLINE, then isn't there a possibility that the task could
be woken up on the cpu which has not yet come online ?
Because at this point in find_lowest_cpus()

cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);

the cpu which has not yet come online is set in both rd->online map
and the task->cpus_allowed.

I wonder if assigning a priority to the update_sched_domains() notifier
so that it's called immediately after migration_call() would solve the
problem.

--
Thanks and Regards
gautham

Gregory Haskins

unread,

Mar 10, 2008, 8:50:10 AM3/10/08

to

>>> On Sat, Mar 8, 2008 at 9:35 PM, in message
<20080309023...@linux-os.sc.intel.com>, Suresh Siddha

<suresh....@intel.com> wrote:
> On Sat, Mar 08, 2008 at 12:10:15AM -0500, Gregory Haskins wrote:
>> Suresh Siddha wrote:
>> >
>> > - cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
>> > + cpus_and(*lowest_mask, task->cpus_allowed, cpu_online_map);
>>
>> Hi Suresh,
>> Unfortunately, this patch will introduce its own set of bugs.
>
> Is that because of the missing get/put_online_cpus() ?

Actually, I was referring to the problem of potentially including "out of domain" CPUs in the search, but that is a good point too. Ill have to think about whether a get/put is needed here.

>
>> However, your analysis was spot-on. I think I see the problem now. It
>> was introduced when I put a hack in to "fix" s2ram problems in -mm as a
>> result of the new root-domain logic. I think the following patch will
>> fix both issues:
>
> BTW, what is use of per root domains online map, when we have global
> cpu_online_map() ? Each domains span & cpu_online_map should give
> domain_online_map anyhow.

Thats exactly right. rd->online is a cached version of "rd->span & cpu_online_map". It simply saves us from having to do an extra cpu_and() at run-time (which can be expensive as the NR_CPUS grows large).

>
> And is n't it buggy if someone accesses rd->online with out
> get/put_online_map()

See above: I'm not sure, but you may be right about that.

Regards,
-Greg

Gautham R Shenoy

unread,

Mar 10, 2008, 9:20:15 AM3/10/08

to

cpu-hotplug: Register update_sched_domains() notifier with higher prio
From: Gautham R Shenoy <e...@in.ibm.com>

The current -rt wake_up logic uses the rq->rd->online_map which is
updated on cpu-hotplug events by update_sched_domains(). Currently
update_sched_domains() is registered with a priority 0. It is
preferable that update_sched_domains() be the first notifier called on
a CPU_DEAD or CPU_ONLINE events inorder to ensure that the
rq->rd->online_map is in sync with the cpu_online_map.

This way on a CPU_DOWN_PREPARE,
we would destroy the sched_domains and reattach all the rq's to the
def_root_domain. This will be followed by a CPU_DOWN_PREPARE in
migration_call() which will clear the bit of the cpu going offline
from the corresponding rq.
This will ensure that no task will be woken up on the cpu which is
going to be offlined between CPU_DOWN_PREPARE and CPU_DEAD.

Ditto for the CPU_ONLINE path.

Tested on a 4-way Intel Xeon Box with cpu-hotplug tests running in parallel
with kernbench.

Signed-off-by: Gautham R Shenoy <e...@in.ibm.com>
---

kernel/sched.c | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 69ecb1a..374c46f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7079,8 +7079,16 @@ void __init sched_init_smp(void)
if (cpus_empty(non_isolated_cpus))
cpu_set(smp_processor_id(), non_isolated_cpus);
put_online_cpus();
- /* XXX: Theoretical race here - CPU may be hotplugged now */
- hotcpu_notifier(update_sched_domains, 0);
+ /*
+ * XXX: Theoretical race here - CPU may be hotplugged now
+ *
+ * We register the notifier with priority 11, which means that
+ * update_sched_domains() will be called just before migration_call().
+ *
+ * This is necessary to ensure that the rt wake up logic works fine
+ * and the rq->rd->online_map remains in sync with the cpu_online_map.
+ */
+ hotcpu_notifier(update_sched_domains, 11);

/* Move init over to a non-isolated CPU */
if (set_cpus_allowed(current, non_isolated_cpus) < 0)

Gregory Haskins

unread,

Mar 10, 2008, 10:20:16 AM3/10/08

to

>>> On Mon, Mar 10, 2008 at 4:14 AM, in message

<20080310081...@in.ibm.com>, Gautham R Shenoy <e...@in.ibm.com> wrote:

> On Sat, Mar 08, 2008 at 12:10:15AM -0500, Gregory Haskins wrote:

[ snip ]

>> @@ -5813,6 +5813,13 @@ migration_call(struct notifier_block *nfb, unsigned
> long action, void *hcpu)
>> /* Must be high prio: stop_machine expects to yield to it. */
>> rq = task_rq_lock(p, &flags);
>> __setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
>> +
>> + /* Update our root-domain */
>> + if (rq->rd) {
>> + BUG_ON(!cpu_isset(cpu, rq->rd->span));
>> + cpu_set(cpu, rq->rd->online);
>> + }
>
> Hi Greg,
>
> Suppose someone issues a wakeup at some point between CPU_UP_PREPARE
> and CPU_ONLINE, then isn't there a possibility that the task could
> be woken up on the cpu which has not yet come online ?
> Because at this point in find_lowest_cpus()
>
> cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
>
> the cpu which has not yet come online is set in both rd->online map
> and the task->cpus_allowed.

Hi Gautham,
This is a good point. I think I got it backwards (or rather, I had the ordering more correct before the patch): I really want to keep the set() on ONLINE as I had it originally. Its the clear() that is misplaced, as DOWN_PREPARE is too early in the chain. I probably should have used DYING to defer the clear out to the point right before the cpu_online_map is updated. Thanks to Dmitry for highlighting the smpboot path which helped me to find the proper notifier symbol.

>
> I wonder if assigning a priority to the update_sched_domains() notifier
> so that it's called immediately after migration_call() would solve the
> problem.

Possibly (and I just saw your patch). But note that migration_call was previously declared as "should run first" so I am not sure if moving update_sched_domain() above it will not have a ripple effect in some other area. If that is "ok", then I agree your solution solves the ordering problem, albeit in a bit heavy handed manner. Otherwise s/CPU_DOWN_PREPARE/CPU_DYING in migrate call will fix the issue as well, I believe. This will push the update of rd->online to be much more tightly coupled with updates to cpu_online_map.

Ingo/Andrew: Please drop the earlier fix I submitted. I don't think it is correct on several fronts. Please try the one below. As before, I have tested that I can offline/online CPUs, but I dont have s2ram capability here.

Regards,
-Greg

---------------------------------------------
keep rd->online and cpu_online_map in sync

It is possible to allow the root-domain cache of online cpus to
become out of sync with the global cpu_online_map. This is because we
currently trigger removal of cpus too early in the notifier chain.
Other DOWN_PREPARE handlers may in fact run and reconfigure the
root-domain topology, thereby stomping on our own offline handling.

The end result is that rd->online may become out of sync with
cpu_online_map, which results in potential task misrouting.

So change the offline handling to be more tightly coupled with the
global offline process by triggering on CPU_DYING intead of
CPU_DOWN_PREPARE.

Signed-off-by: Gregory Haskins <ghas...@novell.com>
---

kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..a616fa1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5881,7 +5881,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
spin_unlock_irq(&rq->lock);
break;

- case CPU_DOWN_PREPARE:
+ case CPU_DYING:

/* Update our root-domain */
rq = cpu_rq(cpu);
spin_lock_irqsave(&rq->lock, flags);

--

Gregory Haskins

unread,

Mar 10, 2008, 10:20:19 AM3/10/08

to

Dmitry Adamushko wrote:
> On 08/03/2008, Gregory Haskins <ghas...@novell.com> wrote:
>
> I guess, it's in arch-specific code. e.g. for x86-64,
>
> down:
>
> kernel/cpu.c :: _cpu_down -> take_cpu_down() -> __cpu_disable() [
> arch/x86/kernel/smpboot_64.c ] -> cpu_clear(cpu, cpu_online_map)
>
> up:
>
> kernel/cpu.c :: _cpu_up() -> __cpu_up [ arch/x86/kernel/smpboot_64.c ]
> -> do_boot_cpu() -> start_secondary() [ runs on a to-be-online cpu ]
> -> cpu_set(smp_processor_id(), cpu_online_map)

Thanks Dmitry! Your tip helped me to figure out what the real solution
should be.

Gautham R Shenoy

unread,

Mar 10, 2008, 10:30:21 AM3/10/08

to

I think CPU_DYING is a more apt point to clear the bit.
Vatsa had infact suggested that we do it in cpu_disable(), but since we
have an event, we might as well make use of it.

>
> >
> > I wonder if assigning a priority to the update_sched_domains() notifier
> > so that it's called immediately after migration_call() would solve the
> > problem.
>
> Possibly (and I just saw your patch). But note that migration_call was previously declared as "should run first" so I am not sure if moving update_sched_domain() above it will not have a ripple effect in some other area. If that is "ok", then I agree your solution solves the ordering problem, albeit in a bit heavy handed manner. Otherwise s/CPU_DOWN_PREPARE/CPU_DYING in migrate call will fix the issue as well, I believe. This will push the update of rd->online to be much more tightly coupled with updates to cpu_online_map.

The reason why migration_call() was declared as "should run first" was
to ensure that all the migration happens before the other subsystems run
their CPU_DEAD events. This is important, else when the subsystems issue
a kthread_stop(subsystem_thread) in their CPU_DEAD case, the
subsystem_thread might still be affined to the offlined cpu, thus
resulting in a deadlock.

That requirement is taken care of in the patch I had sent since
migration_call() will be called immediately after update_sched_domains().

--
Thanks and Regards
gautham

Suresh Siddha

unread,

Mar 10, 2008, 2:20:08 PM3/10/08

to

On Mon, Mar 10, 2008 at 09:39:34AM -0400, Gregory Haskins wrote:

Don't we need to take care of CPU_DYING_FROZEN aswell?

Gregory Haskins

unread,

Mar 10, 2008, 5:30:23 PM3/10/08

to

It is possible to allow the root-domain cache of online cpus to
become out of sync with the global cpu_online_map. This is because we
currently trigger removal of cpus too early in the notifier chain.
Other DOWN_PREPARE handlers may in fact run and reconfigure the
root-domain topology, thereby stomping on our own offline handling.

The end result is that rd->online may become out of sync with
cpu_online_map, which results in potential task misrouting.

So change the offline handling to be more tightly coupled with the
global offline process by triggering on CPU_DYING intead of
CPU_DOWN_PREPARE.

Signed-off-by: Gregory Haskins <ghas...@novell.com>
Cc: Gautham R Shenoy <e...@in.ibm.com>
Cc: "Siddha, Suresh B" <suresh....@intel.com>
Cc: Ingo Molnar <mi...@elte.hu>
Cc: "Rafael J. Wysocki" <r...@sisk.pl>
Cc: Andrew Morton <ak...@linux-foundation.org>
---

kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..a616fa1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5881,7 +5881,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
spin_unlock_irq(&rq->lock);
break;

- case CPU_DOWN_PREPARE:
+ case CPU_DYING:

Rafael J. Wysocki

unread,

Mar 10, 2008, 6:10:11 PM3/10/08

to

Well, I'd say we do.

Gregory Haskins

unread,

Mar 10, 2008, 6:10:12 PM3/10/08

to

Should I add that to the patch as well then?

Suresh Siddha

unread,

Mar 10, 2008, 6:20:05 PM3/10/08

to

On Mon, Mar 10, 2008 at 04:00:28PM -0600, Gregory Haskins wrote:
> >>> On Mon, Mar 10, 2008 at 6:03 PM, in message <20080310230...@sisk.pl>,
> "Rafael J. Wysocki" <r...@sisk.pl> wrote:
> > On Monday, 10 of March 2008, Suresh Siddha wrote:
> >> >
> >> > - case CPU_DOWN_PREPARE:
> >> > + case CPU_DYING:
> >>
> >> Don't we need to take care of CPU_DYING_FROZEN aswell?
> >
> > Well, I'd say we do.
>
> Should I add that to the patch as well then?

Yes please.

Andrew Morton

unread,

Mar 10, 2008, 6:30:15 PM3/10/08

to

This fixes my poweroff-doesn't-work post-2.6.24 regression.

Gregory Haskins

unread,

Mar 10, 2008, 6:40:17 PM3/10/08

to

>>> On Mon, Mar 10, 2008 at 6:10 PM, in message
<20080310221...@linux-os.sc.intel.com>, Suresh Siddha

<suresh....@intel.com> wrote:
> On Mon, Mar 10, 2008 at 04:00:28PM -0600, Gregory Haskins wrote:
>> >>> On Mon, Mar 10, 2008 at 6:03 PM, in message
> <20080310230...@sisk.pl>,
>> "Rafael J. Wysocki" <r...@sisk.pl> wrote:
>> > On Monday, 10 of March 2008, Suresh Siddha wrote:
>> >> >
>> >> > - case CPU_DOWN_PREPARE:
>> >> > + case CPU_DYING:
>> >>
>> >> Don't we need to take care of CPU_DYING_FROZEN aswell?
>> >
>> > Well, I'd say we do.
>>
>> Should I add that to the patch as well then?
>
> Yes please.

Here is v2 with the suggested improvement

-Greg

------------------------

keep rd->online and cpu_online_map in sync

It is possible to allow the root-domain cache of online cpus to
become out of sync with the global cpu_online_map. This is because we
currently trigger removal of cpus too early in the notifier chain.
Other DOWN_PREPARE handlers may in fact run and reconfigure the
root-domain topology, thereby stomping on our own offline handling.

The end result is that rd->online may become out of sync with
cpu_online_map, which results in potential task misrouting.

So change the offline handling to be more tightly coupled with the
global offline process by triggering on CPU_DYING intead of
CPU_DOWN_PREPARE.

Signed-off-by: Gregory Haskins <ghas...@novell.com>

Cc: Gautham R Shenoy <e...@in.ibm.com>
Cc: "Siddha, Suresh B" <suresh....@intel.com>
Cc: Ingo Molnar <mi...@elte.hu>
Cc: "Rafael J. Wysocki" <r...@sisk.pl>
Cc: Andrew Morton <ak...@linux-foundation.org>
---

kernel/sched.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..1cb53fb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5881,7 +5881,8 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
spin_unlock_irq(&rq->lock);
break;

- case CPU_DOWN_PREPARE:
+ case CPU_DYING:

+ case CPU_DYING_FROZEN:

/* Update our root-domain */
rq = cpu_rq(cpu);
spin_lock_irqsave(&rq->lock, flags);

--

Andrew Morton

unread,

Mar 10, 2008, 7:40:06 PM3/10/08

to

Does this make
cpu-hotplug-register-update_sched_domains-notifier-with-higher-prio.patch
(below) obsolete, or do we want both?

--- a/kernel/sched.c~cpu-hotplug-register-update_sched_domains-notifier-with-higher-prio
+++ a/kernel/sched.c
@@ -7096,8 +7096,16 @@ void __init sched_init_smp(void)

if (cpus_empty(non_isolated_cpus))
cpu_set(smp_processor_id(), non_isolated_cpus);
put_online_cpus();
- /* XXX: Theoretical race here - CPU may be hotplugged now */
- hotcpu_notifier(update_sched_domains, 0);
+ /*
+ * XXX: Theoretical race here - CPU may be hotplugged now
+ *
+ * We register the notifier with priority 11, which means that
+ * update_sched_domains() will be called just before migration_call().
+ *
+ * This is necessary to ensure that the rt wake up logic works fine
+ * and the rq->rd->online_map remains in sync with the cpu_online_map.
+ */
+ hotcpu_notifier(update_sched_domains, 11);

/* Move init over to a non-isolated CPU */
if (set_cpus_allowed(current, non_isolated_cpus) < 0)

_

Suresh Siddha

unread,

Mar 10, 2008, 9:40:10 PM3/10/08

to

On Mon, Mar 10, 2008 at 04:36:13PM -0700, Andrew Morton wrote:
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -5881,7 +5881,8 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> > spin_unlock_irq(&rq->lock);
> > break;
> >
> > - case CPU_DOWN_PREPARE:
> > + case CPU_DYING:
> > + case CPU_DYING_FROZEN:
> > /* Update our root-domain */
> > rq = cpu_rq(cpu);
> > spin_lock_irqsave(&rq->lock, flags);
>
> Does this make
> cpu-hotplug-register-update_sched_domains-notifier-with-higher-prio.patch
> (below) obsolete,

Yes. I would like to Ack for the
keep-rd-online-and-cpu_online_map-in-sync.patch, as it is more cleaner.

> or do we want both?

No. I don't think so. Gautham, do you agree?

thanks,
suresh

Gautham R Shenoy

unread,

Mar 11, 2008, 12:50:09 AM3/11/08

to

On Mon, Mar 10, 2008 at 06:34:56PM -0700, Suresh Siddha wrote:
> On Mon, Mar 10, 2008 at 04:36:13PM -0700, Andrew Morton wrote:
> > > --- a/kernel/sched.c
> > > +++ b/kernel/sched.c
> > > @@ -5881,7 +5881,8 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> > > spin_unlock_irq(&rq->lock);
> > > break;
> > >
> > > - case CPU_DOWN_PREPARE:
> > > + case CPU_DYING:
> > > + case CPU_DYING_FROZEN:
> > > /* Update our root-domain */
> > > rq = cpu_rq(cpu);
> > > spin_lock_irqsave(&rq->lock, flags);
> >
> > Does this make
> > cpu-hotplug-register-update_sched_domains-notifier-with-higher-prio.patch
> > (below) obsolete,
>
> Yes. I would like to Ack for the
> keep-rd-online-and-cpu_online_map-in-sync.patch, as it is more cleaner.
>
>
> > or do we want both?
>
> No. I don't think so. Gautham, do you agree?

Yup I agree! Greg's patch is better since i clears the
bit *just* before we take the cpu down. That way we won't mess with the
ordering.

But I would still prefer that we update the sched_domains early on
after cpu-hotplug rather than doing it after the subsystems have run
their notifiers.

Will send a seperate patch for that one.

>
> thanks,
> suresh
>
> >
> > --- a/kernel/sched.c~cpu-hotplug-register-update_sched_domains-notifier-with-higher-prio
> > +++ a/kernel/sched.c
> > @@ -7096,8 +7096,16 @@ void __init sched_init_smp(void)
> > if (cpus_empty(non_isolated_cpus))
> > cpu_set(smp_processor_id(), non_isolated_cpus);
> > put_online_cpus();
> > - /* XXX: Theoretical race here - CPU may be hotplugged now */
> > - hotcpu_notifier(update_sched_domains, 0);
> > + /*
> > + * XXX: Theoretical race here - CPU may be hotplugged now
> > + *
> > + * We register the notifier with priority 11, which means that
> > + * update_sched_domains() will be called just before migration_call().
> > + *
> > + * This is necessary to ensure that the rt wake up logic works fine
> > + * and the rq->rd->online_map remains in sync with the cpu_online_map.
> > + */
> > + hotcpu_notifier(update_sched_domains, 11);
> >
> > /* Move init over to a non-isolated CPU */
> > if (set_cpus_allowed(current, non_isolated_cpus) < 0)
> > _
> >

--
Thanks and Regards
gautham