The kernel thread [watchdog/1] isn't reaped by its parent when cpu #1
is set to offline, so that [kstopmachine] sleeps for ever.
After i investigated, debugged and analyzed it, i think it is a scheduler
specific issue, i noticed someone submitted a patch which mentioned
kthread_should_stop calling rule:
"In order to safely use kthread_stop() for kthread, there is a requirement
on how its main loop has to be orginized. Namely, the sequence of
events that lead to kthread being blocked (schedule()) has to be
ordered as follows:
set_current_state(TASK_INTERRUPTIBLE);
if (kthread_should_stop()) break;
schedule() or similar.
set_current_state() implies a full memory barrier. kthread_stop()
has a matching barrier right after an update of kthread_stop_info.k
and before kthread's wakeup."
I don't think it is a good programming paradigm because it will result
in some issues very easily, why can't we provide a simple wrapper for it?
This issue seems such one, but i tried to change it to follow this rule but
the issue is still there.
Why isn't the kernel thread [watchdog/1] reaped by its parent? its state
is TASK_RUNNING with high priority (R< means this), why it isn't done?
Anyone ever met such a problem? Your thought?
Here is my stress test script, you may try it on SMP platforms (the more
number of cpu is , the easier it can be regenerated.).
########################################################################
####### stresscpus.sh #######
#!/bin/sh
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
do
./bugexposure.sh $i &
done
######## bugexposure.sh ###########
#!/bin/bash
main(){
typeset -i CPU=$1
./task.sh > /dev/null&
PID=$!
if [ `cat /sys/devices/system/cpu/cpu${CPU}/online` = "0" ]; then
echo "1" > /sys/devices/system/cpu/cpu${CPU}/online
fi
MASK=$((1<<${CPU}))
`taskset -p ${MASK} ${PID} > /dev/null 2>&1`
echo "0" > /sys/devices/system/cpu/cpu${CPU}/online
echo "1" > /sys/devices/system/cpu/cpu${CPU}/online
disown $PID
kill -9 $PID > /dev/null 2>&1
#echo "PASS\n"
}
typeset -i TEST_CPU=$1
while true
do
main $TEST_CPU
done
###### task.sh ######
#!/bin/bash
while :
do
NOOP=1
done
########################################################################
My tracing and analysis info is below:
# ps aux | grep watchdog
root 5 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/0]
root 8 0.0 0.0 0 0 ? R< 22:26 0:00 [watchdog/1]
root 11 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/2]
root 14 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/3]
root 17 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/4]
root 20 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/5]
root 23 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/6]
root 26 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/7]
root 29 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/8]
root 32 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/9]
root 35 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/10]
root 38 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/11]
root 41 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/12]
root 44 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/13]
root 50 0.0 0.0 0 0 ? S< 22:26 0:00 [watchdog/15]
root 5641 0.0 0.0 61144 708 pts/0 R+ 22:58 0:00 grep watchdog
#
# ps aux | grep ksoftirqd
root 4 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/0]
root 10 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/2]
root 13 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/3]
root 16 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/4]
root 19 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/5]
root 22 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/6]
root 25 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/7]
root 28 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/8]
root 31 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/9]
root 34 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/10]
root 37 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/11]
root 40 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/12]
root 43 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/13]
root 49 0.0 0.0 0 0 ? S< 22:26 0:00 [ksoftirqd/15]
root 5647 0.0 0.0 61144 712 pts/0 R+ 23:00 0:00 grep ksoftirqd
#
#ps aux
...
root 5554 0.0 0.0 0 0 ? S< 22:42 0:00 [kstopmachine]
root 5556 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5557 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5558 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5559 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5560 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5561 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5562 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5563 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5564 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5565 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5566 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5567 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5568 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
root 5569 0.0 0.0 0 0 ? Z< 22:42 0:00 [kstopmachine] <defunct>
...
[watchdog/1] isn't terminated so that [kstopmachine] will waiting for it
for ever, so that "echo 0 > /sys/devices/syste/cpu/cpu1/online" will hang
up, so that hotplug spinlock will be holden by it, so that other spinlock
contenter will wait for locking it for ever, so that...
These are some softlock detection output:
process 5471 (task.sh) no longer affine to cpu1
INFO: task group_balance:51 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
group_balance D ffff810001011580 0 51 2
ffff81027e97fdc0 0000000000000046 ffff81027e97fdd0 ffffffff80472b15
0000000000000000 ffff81027e97d280 ffff81027c8bd300 ffff81027e97d5d0
0000000080569da0 ffff81027e97d5d0 000000007e97d280 000000010009d366
Call Trace:
[<ffffffff80472b15>] thread_return+0x3d/0xa2
[<ffffffff8023cce5>] lock_timer_base+0x26/0x4b
[<ffffffff8047338b>] __mutex_lock_slowpath+0x5c/0x93
[<ffffffff80473221>] mutex_lock+0x1a/0x1e
[<ffffffff8024f8fd>] get_online_cpus+0x27/0x3e
[<ffffffff8022fd04>] load_balance_monitor+0x60/0x375
[<ffffffff8022fca4>] load_balance_monitor+0x0/0x375
[<ffffffff80245ff6>] kthread+0x47/0x75
[<ffffffff802301f0>] schedule_tail+0x28/0x5c
[<ffffffff8020cc78>] child_rip+0xa/0x12
[<ffffffff80245faf>] kthread+0x0/0x75
[<ffffffff8020cc6e>] child_rip+0x0/0x12
INFO: task bugexposure.sh:5462 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bugexposure.s D ffff8100010e1580 0 5462 1
ffff81026a981c58 0000000000000082 ffff81026a981c68 ffffffff80472b15
00000000000004d6 ffff81027c83c920 ffff81027e8bd680 ffff81027c83cc70
0000000d7e8bc4c0 ffff81027c83cc70 0000000d00000002 000000010009d809
Call Trace:
[<ffffffff80472b15>] thread_return+0x3d/0xa2
[<ffffffff80314b25>] __next_cpu+0x19/0x28
[<ffffffff8022a5d4>] enqueue_rt_entity+0x4e/0xb1
[<ffffffff80472f34>] schedule_timeout+0x1e/0xad
[<ffffffff802280bf>] enqueue_task+0x4d/0x58
[<ffffffff80472ccf>] wait_for_common+0xd5/0x118
[<ffffffff8022c31b>] default_wake_function+0x0/0xe
[<ffffffff80245c93>] kthread_stop+0x58/0x79
[<ffffffff8046fdcf>] cpu_callback+0x189/0x197
[<ffffffff8046f59e>] cpu_callback+0x1cf/0x274
[<ffffffff8026c07e>] writeback_set_ratelimit+0x19/0x68
[<ffffffff8047658b>] notifier_call_chain+0x29/0x4c
[<ffffffff8024f74d>] _cpu_down+0x1d6/0x2aa
[<ffffffff8024f845>] cpu_down+0x24/0x31
[<ffffffff8038fcf4>] store_online+0x29/0x67
[<ffffffff802d23e1>] sysfs_write_file+0xd2/0x110
[<ffffffff8028db49>] vfs_write+0xad/0x136
[<ffffffff8028e086>] sys_write+0x45/0x6e
[<ffffffff8020bfd9>] tracesys+0xdc/0xe1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> per_cpu(watchdog_task, hotcpu) = NULL;
> + mlseep(1);
that wont build very well ...
Ingo
iirc, Andrew had the same issue.
'watchdog's are supposed to be stopped with kthread_stop() from
softlockup.c :: cpu_callback() :: case CPU_DEAD.
The 'R' state of 'watchdog' is strange indeed.
would you please conduct a test with the patch [1] below and provide
me with additional output it generates?
(non-white-space-damaged versions are enclosed)
with your current set-up and also with
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
Then please try also the patch [2] (on top of [1]). With this 'magic'
patch applied, the issue 'seemed' to disappear on Andrew's set-up...
Thanks in advacne,
[1]
--- softlockup-prev.c 2008-03-03 12:35:01.000000000 +0100
+++ softlockup.c 2008-03-03 12:38:01.000000000 +0100
@@ -237,6 +237,8 @@ static int watchdog(void *__bind_cpu)
}
+ printk(KERN_WARN "-> watchdog(cpu: %d) is done\n", this_cpu);
+
return 0;
}
@@ -249,6 +251,9 @@ cpu_callback(struct notifier_block *nfb,
int hotcpu = (unsigned long)hcpu;
struct task_struct *p;
+ printk(KERN_WARN "-> cpu_callback(cpu: %d, action: %lu,
check_cpu: %d)\n",
+ hotcpu, action, check_cpu);
+
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
[2]
--- softlockup-prev-2.c 2008-03-03 12:38:36.000000000 +0100
+++ softlockup.c 2008-03-03 12:39:02.000000000 +0100
@@ -294,6 +294,7 @@ cpu_callback(struct notifier_block *nfb,
case CPU_DEAD_FROZEN:
p = per_cpu(watchdog_task, hotcpu);
per_cpu(watchdog_task, hotcpu) = NULL;
+ mlseep(1);
kthread_stop(p);
break;
#endif /* CONFIG_HOTPLUG_CPU */
--
Best regards,
Dmitry Adamushko
yeah, I forgot to mention that it's not even compile-tested :-/
I re-created it from scratch instead of looking for the original one.
please, this one (again, not compile-tested)
--- softlockup-prev-2.c 2008-03-03 12:38:36.000000000 +0100
+++ softlockup.c 2008-03-03 13:00:20.000000000 +0100
@@ -294,6 +294,7 @@ cpu_callback(struct notifier_block *nfb,
case CPU_DEAD_FROZEN:
p = per_cpu(watchdog_task, hotcpu);
per_cpu(watchdog_task, hotcpu) = NULL;
+ msleep(1);
Hi Yi,
This is indeed strange. I am able to reproduce this problem on my 4-way
box. From what I see in the past two runs, we're waiting in the
cpu-hotplug callback path for the watchdog/1 thread to stop.
During cpu-offline, once the cpu goes offline, in the migration_call(),
we migrate any tasks associated with the offline cpus
to some other cpu. This also mean breaking affinity for tasks which were
affined to the cpu which went down. So watchdog/1 has been migrated to
some other cpu.
However, it remains in R< state and has not executed the
kthread_should_stop() instruction.
I'm trying to probe further by inserting a few more printk's in there.
Will post the findings in a couple of hours.
Thanks for reporting the problem.
Regards
gautham.
Maybe migration_call was doing such a bad thing. :-)
>
> However, it remains in R< state and has not executed the
> kthread_should_stop() instruction.
>
> I'm trying to probe further by inserting a few more printk's in there.
>
> Will post the findings in a couple of hours.
>
> Thanks for reporting the problem.
>
> Regards
> gautham.
I don't think it can fix this issue, it only gives one chance to
scheduler, i think there are another potential and very serious issues
inside of scheduler or locking or what else we don't know.
Maybe migration is a doubtful point as Gautham mentioned.
Yes, it is killed once it gets a chance to run *after* cpu goes offline.
The moment it runs on some other cpu, it will see the kthread_should_stop()
because in the cpu-hotplug callback path we've issues a
kthread_stop(watchdog/1)
Again, we can argue that we could issue a kthread_stop()
in CPU_DOWN_PREPARE, rather than in CPU_DEAD and restart
it in CPU_DOWN_FAILED if the cpu-hotplug operation does fail.
>
> Maybe migration_call was doing such a bad thing. :-)
Nope, from what I see migration call is not having any problems. It is
behaving the way it is supposed to behave :)
The other observation I noted was the WARN_ON_ONCE() in hrtick() [1]
that I am consistently hitting after the first cpu goes offline.
So at times, the callback thread is blocked on kthread_stop(k) in
softlockup.c, while other time, it was blocked in
cleanup_workqueue_threads() in workqueue.c.
This was with the debug patch[2]
Not sure if this is linked to the problem that Yi has pointed out
but looks like a regression. I'll see if this can be reproduced on
2.6.24, 2.6.25-rc1 and 2.6.25-rc2.
[1] The WARN_ON_ONCE() trace.
------------[ cut here ]------------
WARNING: at kernel/sched.c:1007 hrtick+0x32/0x6a()
Modules linked in: dock
Pid: 4451, comm: bash Not tainted 2.6.25-rc3 #26
[<c011f6c8>] warn_on_slowpath+0x41/0x51
[<c013a0dd>] ? trace_hardirqs_on+0xd3/0x111
[<c04e43a3>] ? _spin_unlock_irqrestore+0x42/0x58
[<c02767e7>] ? blk_run_queue+0x64/0x68
[<c033ae6e>] ? scsi_run_queue+0x18d/0x195
[<c027fd7b>] ? kobject_put+0x14/0x16
[<c02e1c3f>] ? put_device+0x11/0x13
[<c013af7b>] ? __lock_acquire+0xaae/0xaf6
[<c01320bf>] ? __run_hrtimer+0x35/0x70
[<c0119a8a>] hrtick+0x32/0x6a
[<c0119a58>] ? hrtick+0x0/0x6a
[<c01320c3>] __run_hrtimer+0x39/0x70
[<c01328e8>] hrtimer_interrupt+0xed/0x156
[<c0112db9>] smp_apic_timer_interrupt+0x6c/0x7f
[<c010568b>] apic_timer_interrupt+0x33/0x38
[<c01202ed>] ? vprintk+0x2d0/0x328
[<c027fe47>] ? kobject_release+0x4b/0x50
[<c027fdfc>] ? kobject_release+0x0/0x50
[<c04dea24>] ? cpuid_class_cpu_callback+0x0/0x50
[<c0280931>] ? kref_put+0x39/0x44
[<c027fd7b>] ? kobject_put+0x14/0x16
[<c02e1c3f>] ? put_device+0x11/0x13
[<c014e2e3>] ? cpu_swap_callback+0x0/0x3d
[<c012035a>] printk+0x15/0x17
[<c04e61d0>] notifier_call_chain+0x40/0x9b
[<c04e2d25>] ? mutex_unlock+0x8/0xa
[<c0143c29>] ? __stop_machine_run+0x8c/0x95
[<c013e1b3>] ? take_cpu_down+0x0/0x27
[<c01331e8>] __raw_notifier_call_chain+0xe/0x10
[<c01331f6>] raw_notifier_call_chain+0xc/0xe
[<c013e37e>] _cpu_down+0x1a4/0x269
[<c013e466>] cpu_down+0x23/0x30
[<c02e58e7>] store_online+0x27/0x5a
[<c02e58c0>] ? store_online+0x0/0x5a
[<c02e2a9c>] sysdev_store+0x20/0x25
[<c0197e65>] sysfs_write_file+0xad/0xdf
[<c0197db8>] ? sysfs_write_file+0x0/0xdf
[<c0165099>] vfs_write+0x8c/0x108
[<c0165623>] sys_write+0x3b/0x60
[<c0104b12>] sysenter_past_esp+0x5f/0xa5
=======================
---[ end trace 22cbd9e369049151 ]---
[2] The debug patch
----->
Index: linux-2.6.25-rc3/kernel/cpu.c
===================================================================
--- linux-2.6.25-rc3.orig/kernel/cpu.c
+++ linux-2.6.25-rc3/kernel/cpu.c
@@ -18,7 +18,7 @@
/* Serializes the updates to cpu_online_map, cpu_present_map */
static DEFINE_MUTEX(cpu_add_remove_lock);
-static __cpuinitdata RAW_NOTIFIER_HEAD(cpu_chain);
+__cpuinitdata RAW_NOTIFIER_HEAD(cpu_chain);
/* If set, cpu_up and cpu_down will return -EBUSY and do nothing.
* Should always be manipulated under cpu_add_remove_lock
@@ -207,11 +207,14 @@ static int _cpu_down(unsigned int cpu, i
if (!cpu_online(cpu))
return -EINVAL;
+ printk("[HOTPLUG] calling cpu_hotplug_begin\n");
cpu_hotplug_begin();
+ printk("[HOTPLUG] calling CPU_DOWN_PREPARE\n");
err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
hcpu, -1, &nr_calls);
if (err == NOTIFY_BAD) {
nr_calls--;
+ printk("[HOTPLUG] calling CPU_DOWN_FAILED\n");
__raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
hcpu, nr_calls, NULL);
printk("%s: attempt to take down CPU %u failed\n",
@@ -226,10 +229,12 @@ static int _cpu_down(unsigned int cpu, i
cpu_clear(cpu, tmp);
set_cpus_allowed(current, tmp);
+ printk("[HOTPLUG] calling stop_machine_run()\n");
p = __stop_machine_run(take_cpu_down, &tcd_param, cpu);
if (IS_ERR(p) || cpu_online(cpu)) {
/* CPU didn't die: tell everyone. Can't complain. */
+ printk("[HOTPLUG] calling CPU_DOWN_FAILED\n");
if (raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
hcpu) == NOTIFY_BAD)
BUG();
@@ -241,13 +246,16 @@ static int _cpu_down(unsigned int cpu, i
goto out_thread;
}
+ printk("[HOTPLUG] waiting for idle_cpu()\n");
/* Wait for it to sleep (leaving idle task). */
while (!idle_cpu(cpu))
yield();
+ printk("[HOTPLUG] calling __cpu_die()\n");
/* This actually kills the CPU. */
__cpu_die(cpu);
+ printk("[HOTPLUG] calling CPU_DEAD\n");
/* CPU is completely dead: tell everyone. Too late to complain. */
if (raw_notifier_call_chain(&cpu_chain, CPU_DEAD | mod,
hcpu) == NOTIFY_BAD)
@@ -256,11 +264,14 @@ static int _cpu_down(unsigned int cpu, i
check_for_tasks(cpu);
out_thread:
+ printk("[HOTPLUG] calling kthread_stop_machine\n");
err = kthread_stop(p);
out_allowed:
set_cpus_allowed(current, old_allowed);
out_release:
+ printk("[HOTPLUG] calling cpu_hotplug_done()\n");
cpu_hotplug_done();
+ printk("[HOTPLUG] returning from _cpu_down()\n");
return err;
}
Index: linux-2.6.25-rc3/kernel/notifier.c
===================================================================
--- linux-2.6.25-rc3.orig/kernel/notifier.c
+++ linux-2.6.25-rc3/kernel/notifier.c
@@ -5,7 +5,7 @@
#include <linux/rcupdate.h>
#include <linux/vmalloc.h>
#include <linux/reboot.h>
-
+#include <linux/kallsyms.h>
/*
* Notifier list for kernel code which wants to be called
* at shutdown. This is used to stop any idling DMA operations
@@ -44,6 +44,7 @@ static int notifier_chain_unregister(str
return -ENOENT;
}
+extern struct raw_notifier_head cpu_chain;
/**
* notifier_call_chain - Informs the registered notifiers about an event.
* @nl: Pointer to head of the blocking notifier chain
@@ -62,12 +63,21 @@ static int __kprobes notifier_call_chain
{
int ret = NOTIFY_DONE;
struct notifier_block *nb, *next_nb;
+ char name_buf[100];
nb = rcu_dereference(*nl);
while (nb && nr_to_call) {
next_nb = rcu_dereference(nb->next);
+ if (nl == &cpu_chain.head) {
+ sprint_symbol(name_buf, (unsigned long)nb->notifier_call);
+ printk("[HOTPLUG] calling callback:%s\n", name_buf);
+ }
ret = nb->notifier_call(nb, val, v);
+ if (nl == &cpu_chain.head) {
+ sprint_symbol(name_buf, (unsigned long)nb->notifier_call);
+ printk("[HOTPLUG] returned from callback:%s\n", name_buf);
+ }
if (nr_calls)
(*nr_calls)++;
> >
> > However, it remains in R< state and has not executed the
> > kthread_should_stop() instruction.
> >
> > I'm trying to probe further by inserting a few more printk's in there.
> >
> > Will post the findings in a couple of hours.
> >
> > Thanks for reporting the problem.
> >
> > Regards
> > gautham.
>
--
Thanks and Regards
gautham
That issue is still there after the above patch is applied.
I found that [watchdog/#] is indeed migrated to other cpu because
migration_call is called before cpu_callback, i think this is the real
root cause very very possibly.
I suggest we can develop a new notifier infrastructure in which one
caller can specify whether it is kthread_stopping a cpu-bind kthread
so that such notifier callbacks can be executed prior to other
callbacks.
This is the hung_task_timeout message after a couple of cpu-offlines.
This is on 2.6.25-rc3.
INFO: task bash:4467 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
f3701dd0 00000046 f796aac0 f796aac0 f796abf8 cc434b80 00000000 f41ee940
0180b046 0000026e 00000016 00000000 00000008 f796b080 f796aac0 00000002
7fffffff 7fffffff f3701e1c f3701df8 c04e033a f3701e1c f3701dec c0139dec
Call Trace:
[<c04e033a>] schedule_timeout+0x16/0x8b
[<c0139dec>] ? trace_hardirqs_on+0xe9/0x111
[<c04e01c9>] wait_for_common+0xcf/0x12e
[<c011a3f0>] ? default_wake_function+0x0/0xd
[<c04e02aa>] wait_for_completion+0x12/0x14
[<c012ccbb>] flush_cpu_workqueue+0x50/0x66
[<c012cd28>] ? wq_barrier_func+0x0/0xd
[<c012cd14>] cleanup_workqueue_thread+0x43/0x57
[<c04c6f87>] workqueue_cpu_callback+0x8e/0xbd
[<c04e3975>] notifier_call_chain+0x2b/0x4a
[<c0132e9d>] __raw_notifier_call_chain+0xe/0x10
[<c0132eab>] raw_notifier_call_chain+0xc/0xe
[<c013e054>] _cpu_down+0x150/0x1ec
[<c013e133>] cpu_down+0x23/0x30
[<c02e3897>] store_online+0x27/0x5a
[<c02e3870>] ? store_online+0x0/0x5a
[<c02e09d8>] sysdev_store+0x20/0x25
[<c0196d2d>] sysfs_write_file+0xad/0xdf
[<c0196c80>] ? sysfs_write_file+0x0/0xdf
[<c0163da9>] vfs_write+0x8c/0x108
[<c0164333>] sys_write+0x3b/0x60
[<c01049da>] sysenter_past_esp+0x5f/0xa5
=======================
3 locks held by bash/4467:
#0: (&buffer->mutex){--..}, at: [<c0196ca5>] sysfs_write_file+0x25/0xdf
#1: (cpu_add_remove_lock){--..}, at: [<c013e10e>] cpu_maps_update_begin+0xf/0x11
#2: (cpu_hotplug_lock){----}, at: [<c013df5b>] _cpu_down+0x57/0x1ec
So it's not just a not reaping of watchdog thread issue.
I doubt it's due to some locking dependency since we have lockdep checks
in the workqueue code before we flush the cpu_workqueue.
On my machine, that indicated [watchdog/1] is calling
sched_setscheduler. I doubt it is being killed before it is started and
woken up, this may result in some synchronization issues.
From another message:
>
> However, it remains in R< state
What about cwq->thread? Was it TASK_RUNNING too?
Perhaps, for some reason the task can't get CPU after migrating from
the now dead CPU.
I can't reproduce this problem on my one cpu P4-ht, perhaps you can
try something like the untested/uncompiled patch below?
Oleg.
--- include/linux/sched.h 2008-02-15 16:59:17.000000000 +0300
+++ include/linux/sched.h 2008-03-04 17:44:53.136738605 +0300
@@ -1121,6 +1121,7 @@ struct task_struct {
/* hung task detection */
unsigned long last_switch_timestamp;
unsigned long last_switch_count;
+ unsigned long xxx;
#endif
/* CPU-specific state of this task */
struct thread_struct thread;
--- kernel/fork.c 2008-02-15 16:59:17.000000000 +0300
+++ kernel/fork.c 2008-03-04 17:45:14.773033839 +0300
@@ -1097,6 +1097,7 @@ static struct task_struct *copy_process(
#ifdef CONFIG_DETECT_SOFTLOCKUP
p->last_switch_count = 0;
p->last_switch_timestamp = 0;
+ p->xxx = 0;
#endif
#ifdef CONFIG_TASK_XACCT
--- kernel/sched.c 2008-02-15 16:59:17.000000000 +0300
+++ kernel/sched.c 2008-03-04 17:48:42.308798646 +0300
@@ -1291,6 +1291,7 @@ static void enqueue_task(struct rq *rq,
sched_info_queued(p);
p->sched_class->enqueue_task(rq, p, wakeup);
p->se.on_rq = 1;
+ p->xxx = jiffies | 1;
}
static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
@@ -3944,6 +3945,8 @@ need_resched_nonpreemptible:
preempt_enable_no_resched();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
+
+ current->xxx = 0;
}
EXPORT_SYMBOL(schedule);
--- kernel/softlockup.c 2008-02-15 16:59:17.000000000 +0300
+++ kernel/softlockup.c 2008-03-04 17:49:05.584414763 +0300
@@ -174,6 +174,27 @@ static void check_hung_task(struct task_
touch_nmi_watchdog();
}
+static void check_running_task(struct task_struct *t, unsigned long now)
+{
+ if (!sysctl_hung_task_timeout_secs)
+ return;
+
+ if (time_before(now, t->xxx + HZ * sysctl_hung_task_timeout_secs)
+ return;
+
+ printk(KERN_ERR "INFO: task %s:%d can't get CPU for more than "
+ "%ld seconds.\n", t->comm, t->pid,
+ sysctl_hung_task_timeout_secs);
+
+ if (!cpus_intersects(t->cpus_allowed, cpu_online_map))
+ printk(KERN_ERR "bad ->cpus_allowed\n");
+ if (!cpu_online(task_cpu(t)))
+ printk(KERN_ERR "bad ->cpu\n");
+
+ sched_show_task(t);
+ touch_nmi_watchdog();
+}
+
/*
* Check whether a TASK_UNINTERRUPTIBLE does not get woken up for
* a really long time (120 seconds). If that happens, print out
@@ -183,6 +204,7 @@ static void check_hung_uninterruptible_t
{
int max_count = sysctl_hung_task_check_count;
unsigned long now = get_timestamp(this_cpu);
+ unsigned long jiff = jiffies;
struct task_struct *g, *t;
/*
@@ -192,15 +214,17 @@ static void check_hung_uninterruptible_t
if ((tainted & TAINT_DIE) || did_panic)
return;
- read_lock(&tasklist_lock);
+ rcu_read_lock();
do_each_thread(g, t) {
if (!--max_count)
goto unlock;
if (t->state & TASK_UNINTERRUPTIBLE)
check_hung_task(t, now);
+ if (!t->xxx)
+ check_running_task(t, jiff);
} while_each_thread(g, t);
unlock:
- read_unlock(&tasklist_lock);
+ rcu_read_unlock();
}
/*
At the beginning, i thought Detect Soft Lockups
(CONFIG_DETECT_SOFTLOCKUP) is culprit because it added much code to
detect lock dependencies and deadlocks, but the issue is still there
after i disabled soft lock detection, it can be regenerated but will
take much long time.
At this time, group_balance and kblockd are in "R<" state.
[root@harwich-rh ~]# ps aux | grep kstop
root 15466 0.0 0.0 0 0 ? S< 01:03 0:00 [kstopmachine]
root 15467 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15468 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15469 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15470 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15471 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15472 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15473 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15474 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15475 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 15476 0.0 0.0 0 0 ? Z< 01:03 0:00 [kstopmachine] <defunct>
root 17155 0.0 0.0 61148 708 pts/3 R+ 14:47 0:00 grep kstop
[root@harwich-rh ~]# ps aux | grep "R<"
root 35 0.0 0.0 0 0 ? R< Mar13 0:00 [group_balance]
root 182 0.1 0.0 0 0 ? R< Mar13 1:09 [kblockd/0]
root 17157 0.0 0.0 61148 704 pts/3 R+ 14:47 0:00 grep R<
[root@harwich-rh ~]#
No, it was in TASK_UNINTERRUPTIBLE state. The last thing it ever
executed was the wait_for_completion in flush_cpu_workqueue()
--
Thanks and Regards
gautham
You misunderstood ;) The task doing _cpu_down() hang in flush_cpu_workqueue(),
because cwq->thread doesn't want to die. But what was cwq->thread->state?
Could you try the patch? Yi constantly sees the tasks in R< state.
I am starting to suspect some scheduling issue.
Oleg.
Yup, sorry for that stupid reply.
> Could you try the patch? Yi constantly sees the tasks in R< state.
> I am starting to suspect some scheduling issue.
Me too. With your patch applied there were quite a few tasks in the
running state which didn't get the cpu for more than 120 seconds.
I could be a scheduler problem which CPU-Hotplug is able to catch more
easily. Will try a git bisect.
>
> Oleg.
--
Thanks and Regards
gautham
BTW, with your patch, I get the "task can't get CPU for more than
120 seconds" message even while running a kernbench on 2.6.25-rc2.
This function gets called only when t->xxx == 0,
so the if below doesn't mean much, does it? :)
--
Thanks and Regards
gautham
Of course, the check above should be
if (1t->xxx)
check_running_task(t, jiff);
Thanks!
From another message,
>
> Me too. With your patch applied there were quite a few tasks in the
> running state which didn't get the cpu for more than 120 seconds.
(I assume you fixed the patch before using it ;)
Just to be sure, there were no "bad ->cpu..." messages, yes?
Oleg.
What does Priority 0 mean? When and in which case is it's priority
changed to 0? When can it's priority be restored to one higher value.
A simple script to get priority 0:
===================================================================
#!/bin/sh
for file in `ls /proc/*/sched`
do
prio=`grep prio $file | awk '{print $3;}'`
if [ "X$prio" == "X0" ] ; then
echo "`head -n 1 $file | awk '{print $1}'`: $prio"
fi
done
==================================================================
[root@harwich-rh ~]# ./lsprio
migration/3: 0
migration/4: 0
migration/5: 0
migration/6: 0
migration/7: 0
migration/8: 0
migration/9: 0
migration/10: 0
migration/11: 0
migration/12: 0
migration/0: 0
migration/13: 0
migration/14: 0
migration/15: 0
grep: /proc/5244/sched: No such file or directory
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
watchdog/1: 0
migration/2: 0
[root@harwich-rh ~]# ps aux | grep "R<"
root 8 0.0 0.0 0 0 ? R< 19:29 0:00
[watchdog/1]
root 6987 0.0 0.0 61144 708 pts/2 R+ 20:16 0:00 grep R<
[root@harwich-rh ~]#
--
>
> Just to be sure, there were no "bad ->cpu..." messages, yes?
Hopefully should be able to catch them now. If yes, it's a problem in
the way we do migration after cpu-hotplug as Yi suggested in an earlier
mail.
http://lkml.org/lkml/2008/3/6/437
This mail from akpm says the same thing.
>
> Oleg.
--
Thanks and Regards
gautham
Yup! There are a quite a few "bad->cpu" messages.
All of them only for watchdog/1.
What surprises me is the fact that the first of task-hung messages come
after 136 successful cpu-hotplug attempts.
To answer Andrew's question, migration_call() *should* ideally be the
first notifier called, because it's registered with the priority 10.
Unless there's some change that has recently gone in the
notifier_call_chain subsystem that doesn't honour the registration priority
anymore, migration_call() will be called first,
which means the task affinity for watchdog thread is broken and hence
we shouldn't see the message below.
I'll probe to see if there's a change in the call order that's causing
this hang.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
INFO: task watchdog/1:10958 can't get CPU for more than 120 seconds.
bad ->cpu
f524ffac 00000046 c011befc f5276aa0 f5276bd8 c443db80 00000001 f6130b80
f524ff94 00000246 c07130f4 03d28000 00000000 f524ff9c c01438ad f524ffac
00000001 c0143a1f 00000000 f524ffd0 c0143a66 00000001 c0119fea 00000000
Call Trace:
[<c011befc>] ? cpu_clock+0x4e/0x59
[<c01438ad>] ? get_timestamp+0x8/0x11
[<c0143a1f>] ? watchdog+0x0/0x239
[<c0143a66>] watchdog+0x47/0x239
[<c0119fea>] ? complete+0x34/0x3e
[<c0143a1f>] ? watchdog+0x0/0x239
[<c012f797>] kthread+0x3b/0x64
[<c012f75c>] ? kthread+0x0/0x64
[<c01056eb>] kernel_thread_helper+0x7/0x10
=======================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> INFO: task watchdog/1:10958 can't get CPU for more than 120 seconds.
> bad ->cpu
> f524ffac 00000046 c011befc f5276aa0 f5276bd8 c443db80 00000001 f6130b80
> f524ff94 00000246 c07130f4 03d28000 00000000 f524ff9c c01438ad f524ffac
> 00000001 c0143a1f 00000000 f524ffd0 c0143a66 00000001 c0119fea 00000000
> Call Trace:
> [<c011befc>] ? cpu_clock+0x4e/0x59
> [<c01438ad>] ? get_timestamp+0x8/0x11
> [<c0143a1f>] ? watchdog+0x0/0x239
> [<c0143a66>] watchdog+0x47/0x239
> [<c0119fea>] ? complete+0x34/0x3e
> [<c0143a1f>] ? watchdog+0x0/0x239
> [<c012f797>] kthread+0x3b/0x64
> [<c012f75c>] ? kthread+0x0/0x64
> [<c01056eb>] kernel_thread_helper+0x7/0x10
> =======================
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
> >
> > >
> > > Oleg.
> >
> > --
> > Thanks and Regards
> > gautham
>
--
'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
move RT tasks between cpus anymore and there is a special mechanism in
scher_rt.c instead (I think, it's .25 material).
So I wonder, whether __migrate_task() is still capable of properly
moving a RT task to another CPU (e.g. for the case when it's in
TASK_RUNNING state) without breaking something in the rt migration
mechanism (or whatever else) that would leave us with a runqueue in
the 'inconsistent' state...
(I've taken a quick look at the relevant code so can't confirm it yet)
maybe it'd be faster if somebody could do a quick test now with the
following line commented out in kernel/softlockup.c :: watchdog()
- sched_setscheduler(current, SCHED_FIFO, ¶m);
--
Best regards,
Dmitry Adamushko
Commenting out that like seems to work. Passed 500 iterations of
cpu-hotplug without any problems.
>
>
> --
> Best regards,
> Dmitry Adamushko
--
Thanks and Regards
gautham
Well, there is another
sched_setscheduler(p, SCHED_FIFO, ¶m);
in kernel/stop_machine.c
Perhaps we need to remove this aswell and try?
thanks,
suresh
> Hi,
>
> 'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
> move RT tasks between cpus anymore and there is a special mechanism in
> scher_rt.c instead (I think, it's .25 material).
>
> So I wonder, whether __migrate_task() is still capable of properly
> moving a RT task to another CPU (e.g. for the case when it's in
> TASK_RUNNING state) without breaking something in the rt migration
> mechanism (or whatever else) that would leave us with a runqueue in
> the 'inconsistent' state...
> (I've taken a quick look at the relevant code so can't confirm it yet)
>
> maybe it'd be faster if somebody could do a quick test now with the
> following line commented out in kernel/softlockup.c :: watchdog()
>
> - sched_setscheduler(current, SCHED_FIFO, ¶m);
>
Yup, thanks. This:
kernel/softirq.c | 2 +-
kernel/softlockup.c | 2 +-
kernel/stop_machine.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff -puN kernel/softlockup.c~a kernel/softlockup.c
--- a/kernel/softlockup.c~a
+++ a/kernel/softlockup.c
@@ -211,7 +211,7 @@ static int watchdog(void *__bind_cpu)
struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
int this_cpu = (long)__bind_cpu;
- sched_setscheduler(current, SCHED_FIFO, ¶m);
+// sched_setscheduler(current, SCHED_FIFO, ¶m);
/* initialize timestamp */
touch_softlockup_watchdog();
diff -puN kernel/stop_machine.c~a kernel/stop_machine.c
--- a/kernel/stop_machine.c~a
+++ a/kernel/stop_machine.c
@@ -188,7 +188,7 @@ struct task_struct *__stop_machine_run(i
struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
/* One high-prio thread per cpu. We'll do this one. */
- sched_setscheduler(p, SCHED_FIFO, ¶m);
+// sched_setscheduler(p, SCHED_FIFO, ¶m);
kthread_bind(p, cpu);
wake_up_process(p);
wait_for_completion(&smdata.done);
diff -puN kernel/softirq.c~a kernel/softirq.c
--- a/kernel/softirq.c~a
+++ a/kernel/softirq.c
@@ -622,7 +622,7 @@ static int __cpuinit cpu_callback(struct
p = per_cpu(ksoftirqd, hotcpu);
per_cpu(ksoftirqd, hotcpu) = NULL;
- sched_setscheduler(p, SCHED_FIFO, ¶m);
+// sched_setscheduler(p, SCHED_FIFO, ¶m);
kthread_stop(p);
takeover_tasklets(hotcpu);
break;
_
fixes the wont-power-off regression.
But 2.6.24 runs the watchdog threads SCHED_FIFO too. Are you saying that
it's the migration code which changed?
Well, that would be a problem for suspend/hibernation if there were SCHED_FIFO
non-freezable tasks bound to the nonboot CPUs. I'm not aware of any, but ...
Also, it should be possible to just offline one or more CPUs using the sysfs
interface at any time and what happens if there are any RT tasks bound to these
CPUs at that time? That would be the $subject problem, IMHO.
Anyone who made the change affecting __migrate_task() so it's unable to migrate
RT tasks any more should have fixed the CPU hotplug as well.
No. Its not the issue with __migrate_task(). Appended patch fixes my issue.
Recent RT wakeup balance code changes exposed a bug in migration_call() code
path.
Andrew, Please check if the appended patch fixes your power-off problem aswell.
thanks,
suresh
---
Handle the `CPU_DOWN_PREPARE_FROZEN' case in migration_call().
Otherwise, without this, we don't clear the cpu going down in the
root domains "online" mask. This was causing the RT tasks to be woken
up on already dead cpus, causing system hangs during standby, shutdown etc.
For example, on my system, this is the failing sequence:
kthread_stop() // coming from the cpu_callback's
wake_up_process()
sched_class->select_task_rq(); //select_task_rq_rt
find_lowest_rq
find_lowest_cpus
cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
In my case tasks->cpus_allowed is set to cpu_possible_map and because of the
this bug, rd->online still shows the dead cpu. Resulting in
find_lowest_rq() return an offlined cpu, because of which RT task gets woken
up on a DEAD cpu, causing various hangs.
This issue doesn't happen with normal tasks because, the select_task_rq_fair()
chooses between only two cpu's (the cpu which is waking up the task or last run
cpu (task_cpu()), kernel hotplug code makes sure that both of which always
represent the online CPU's).
Why it didn't show up in 2.6.24? Because the new wakeup code is using a
complex CPU selection logic in select_task_rq_rt() which exposed this
bug in migration_call()
Signed-off-by: Suresh Siddha <suresh....@intel.com>
---
diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..60550d8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
break;
case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
/* Update our root-domain */
rq = cpu_rq(cpu);
spin_lock_irqsave(&rq->lock, flags);