Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[3.6.6] panic on reboot / khungtaskd blocked? (WARNING: at arch/x86/kernel/smp.c:123 native_smp_send_reschedule)

5 views
Skip to first unread message

Paweł Sikora

unread,
Nov 9, 2012, 8:48:23 AM11/9/12
to linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org
Hi,

during playing with new ups i've caught an nice oops on reboot:

http://imgbin.org/index.php?page=image&id=10253

probably the upstream is also affected.

BR,
Paweł.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Michael Wang

unread,
Nov 11, 2012, 10:04:39 PM11/11/12
to Paweł Sikora, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org
On 11/09/2012 09:48 PM, Paweł Sikora wrote:
> Hi,
>
> during playing with new ups i've caught an nice oops on reboot:
>
> http://imgbin.org/index.php?page=image&id=10253
>
> probably the upstream is also affected.

Hi, Paweł

Are you using a clean 3.6.6 without any modify?

Looks like some threads has set itself to be UNINTERRUPTIBLE with out
any design on switch itself back later(or the time is too long), are you
accidentally using some bad designed module?

BTW, it's better to paste whole log in mail with text style not a picture.

Regards,
Michael Wang

Paweł Sikora

unread,
Nov 12, 2012, 2:16:43 AM11/12/12
to Michael Wang, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org
On Monday 12 of November 2012 11:04:12 Michael Wang wrote:
> On 11/09/2012 09:48 PM, Paweł Sikora wrote:
> > Hi,
> >
> > during playing with new ups i've caught an nice oops on reboot:
> >
> > http://imgbin.org/index.php?page=image&id=10253
> >
> > probably the upstream is also affected.
>
> Hi, Paweł
>
> Are you using a clean 3.6.6 without any modify?

yes, pure 3.6.6 form git tree with modular config.

> Looks like some threads has set itself to be UNINTERRUPTIBLE with out
> any design on switch itself back later(or the time is too long), are you
> accidentally using some bad designed module?

hmm, hard to say. mostly all modules are loaded automatically by kernel.

# lsmod
Module Size Used by
nfsv4 125022 1
fuse 77993 0
nfsv3 34273 1
nfs 142878 5 nfsv4,nfsv3
fscache 46225 1 nfs
nfsd 247413 13
lockd 76458 3 nfsv3,nfs,nfsd
nfs_acl 12741 2 nfsv3,nfsd
auth_rpcgss 39901 2 nfsv4,nfsd
sunrpc 211604 31 nfsv4,nfsv3,nfs,nfsd,lockd,nfs_acl,auth_rpcgss
ipmi_si 48670 0
ipmi_devintf 17521 0
ipmi_msghandler 43715 2 ipmi_si,ipmi_devintf
sch_sfq 21375 2
iptable_nat 13182 1
nf_nat 24649 1 iptable_nat
nf_conntrack_ipv4 14594 3 iptable_nat,nf_nat
nf_conntrack 74130 3 iptable_nat,nf_nat,nf_conntrack_ipv4
nf_defrag_ipv4 12673 1 nf_conntrack_ipv4
iptable_filter 12810 0
xt_TCPMSS 12707 2
xt_tcpudp 12603 2
iptable_mangle 12695 1
ip_tables 26782 3 iptable_nat,iptable_filter,iptable_mangle
ip6table_filter 12815 0
ip6_tables 26942 1 ip6table_filter
x_tables 27637 8 iptable_nat,iptable_filter,xt_TCPMSS,xt_tcpudp,iptable_mangle,ip_tables,ip6table_filter,ip6_tables
ext4 482400 2
jbd2 89967 1 ext4
crc16 12559 1 ext4
raid0 17188 2
dm_mod 77931 0
autofs4 28341 11
dummy 12915 0
ide_cd_mod 35359 0
cdrom 41920 1 ide_cd_mod
ata_generic 12910 0
pata_acpi 13038 0
pata_atiixp 13271 0
ide_pci_generic 12866 0
igb 125677 0
pcspkr 12718 0
evdev 17797 0
joydev 17457 0
sp5100_tco 13697 0
powernow_k8 18109 1
freq_table 13743 1 powernow_k8
mperf 12607 1 powernow_k8
mgag200 38130 1
amd64_edac_mod 28812 0
edac_core 56455 2 amd64_edac_mod
kvm_amd 55563 0
ttm 75424 1 mgag200
kvm 406326 1 kvm_amd
drm_kms_helper 44701 1 mgag200
drm 250914 3 mgag200,ttm,drm_kms_helper
k10temp 13126 0
hwmon 12853 1 k10temp
i2c_algo_bit 13257 1 mgag200
sysimgblt 12588 1 mgag200
sysfillrect 12654 1 mgag200
syscopyarea 12445 1 mgag200
microcode 18669 0
hid_generic 12493 0
atiixp 12917 0
i2c_piix4 13266 0
dca 14601 1 igb
ptp 18413 1 igb
pps_core 13770 1 ptp
edac_mce_amd 22771 1 amd64_edac_mod
ide_core 108021 3 ide_cd_mod,ide_pci_generic,atiixp
i2c_core 28918 5 mgag200,drm_kms_helper,drm,i2c_algo_bit,i2c_piix4
processor 31231 1 powernow_k8
button 13692 0
ext3 220064 1
jbd 77423 1 ext3
mbcache 14316 2 ext4,ext3
sd_mod 44963 16
crc_t10dif 12483 1 sd_mod
raid1 35216 1
md_mod 112433 5 raid0,raid1
ahci 25731 12
libahci 26007 1 ahci
libata 195604 5 ata_generic,pata_acpi,pata_atiixp,ahci,libahci
scsi_mod 161576 2 sd_mod,libata
usbhid 46734 0
hid 94508 2 hid_generic,usbhid
ohci_hcd 31595 0
ehci_hcd 50048 0
usbcore 173967 4 usbhid,ohci_hcd,ehci_hcd
usb_common 12489 1 usbcore

Michael Wang

unread,
Nov 12, 2012, 2:41:04 AM11/12/12
to Paweł Sikora, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org
On 11/12/2012 03:16 PM, Paweł Sikora wrote:
> On Monday 12 of November 2012 11:04:12 Michael Wang wrote:
>> On 11/09/2012 09:48 PM, Paweł Sikora wrote:
>>> Hi,
>>>
>>> during playing with new ups i've caught an nice oops on reboot:
>>>
>>> http://imgbin.org/index.php?page=image&id=10253
>>>
>>> probably the upstream is also affected.
>>
>> Hi, Paweł
>>
>> Are you using a clean 3.6.6 without any modify?
>
> yes, pure 3.6.6 form git tree with modular config.
>
>> Looks like some threads has set itself to be UNINTERRUPTIBLE with out
>> any design on switch itself back later(or the time is too long), are you
>> accidentally using some bad designed module?
>
> hmm, hard to say. mostly all modules are loaded automatically by kernel.

Could you please provide the whole dmesg in text? your picture lost the
print info of the hung task.

And please try the latest kernel, the issue may already solved(if it is
a issue...).

Regards,
Michael Wang

Paweł Sikora

unread,
Nov 12, 2012, 5:23:03 AM11/12/12
to Michael Wang, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org
On Monday 12 of November 2012 15:40:31 Michael Wang wrote:
> On 11/12/2012 03:16 PM, Paweł Sikora wrote:
> > On Monday 12 of November 2012 11:04:12 Michael Wang wrote:
> >> On 11/09/2012 09:48 PM, Paweł Sikora wrote:
> >>> Hi,
> >>>
> >>> during playing with new ups i've caught an nice oops on reboot:
> >>>
> >>> http://imgbin.org/index.php?page=image&id=10253
> >>>
> >>> probably the upstream is also affected.
> >>
> >> Hi, Paweł
> >>
> >> Are you using a clean 3.6.6 without any modify?
> >
> > yes, pure 3.6.6 form git tree with modular config.
> >
> >> Looks like some threads has set itself to be UNINTERRUPTIBLE with out
> >> any design on switch itself back later(or the time is too long), are you
> >> accidentally using some bad designed module?
> >
> > hmm, hard to say. mostly all modules are loaded automatically by kernel.
>
> Could you please provide the whole dmesg in text? your picture lost the
> print info of the hung task.

i've grabbed the console via rs232 but there's no more info (see attached txt).
the dmesg (filesystem) is not synced during panic (leds on keyboard blink, sysrq doesn't work).
how can i grab more info?
shutdown.txt

Paweł Sikora

unread,
Nov 12, 2012, 7:33:50 AM11/12/12
to Michael Wang, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org
On Monday 12 of November 2012 11:22:47 Paweł Sikora wrote:
> On Monday 12 of November 2012 15:40:31 Michael Wang wrote:
> > On 11/12/2012 03:16 PM, Paweł Sikora wrote:
> > > On Monday 12 of November 2012 11:04:12 Michael Wang wrote:
> > >> On 11/09/2012 09:48 PM, Paweł Sikora wrote:
> > >>> Hi,
> > >>>
> > >>> during playing with new ups i've caught an nice oops on reboot:
> > >>>
> > >>> http://imgbin.org/index.php?page=image&id=10253
> > >>>
> > >>> probably the upstream is also affected.
> > >>
> > >> Hi, Paweł
> > >>
> > >> Are you using a clean 3.6.6 without any modify?
> > >
> > > yes, pure 3.6.6 form git tree with modular config.
> > >
> > >> Looks like some threads has set itself to be UNINTERRUPTIBLE with out
> > >> any design on switch itself back later(or the time is too long), are you
> > >> accidentally using some bad designed module?
> > >
> > > hmm, hard to say. mostly all modules are loaded automatically by kernel.
> >
> > Could you please provide the whole dmesg in text? your picture lost the
> > print info of the hung task.
>
> i've grabbed the console via rs232 but there's no more info (see attached txt).

hmm, i have one observation.

during rc.shutdown there're messages on console like this: Cannot stat file /proc/$pid/fd/1: Connection timed out
afaics this file descriptor points to vnc log file on a remote machine, e.g.:

# ps aux|grep xfwm4
eda 1748 0.0 0.0 320220 11224 ? S 13:08 0:00 xfwm4

# readlink -m /proc/1748/fd/1
/remote/dragon/ahome/eda/.vnc/odra:11.log

# mount|grep ahome
dragon:/home/users/ on /remote/dragon/ahome type nfs (rw,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.0.2.121,mountvers=3,mountport=45251,mountproto=udp,local_lock=none,addr=10.0.2.121)


so, probably during `killall5 -TERM/-KILL` on shutdown stage something sometimes go wrong
and these processes (xfce4/vncserver) survive the signal and hang on the nfs i/o.

Michael Wang

unread,
Nov 12, 2012, 9:51:23 PM11/12/12
to Paweł Sikora, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org
We can report the bug to driver folks, but have to make sure whether
the hung thread is belong to it firstly, so we need the hung task info
(at least the name).

I'm not sure what's the environment we faced, but setup debug method is
necessary, 'script' may help you to record the dmesg even when remote
console hung.

If it's really hard to get dmesg, please try below modify and
see whether the info print out on console.

Regards,
Michael Wang



diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2e077c..2f1b718 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4504,7 +4504,7 @@ void sched_show_task(struct task_struct *p)
unsigned state;

state = p->state ? __ffs(p->state) + 1 : 0;
- printk(KERN_INFO "%-15.15s %c", p->comm,
+ printk(KERN_EMERG "%-15.15s %c", p->comm,
state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?');
#if BITS_PER_LONG == 32
if (state == TASK_RUNNING)

Michael Wang

unread,
Nov 13, 2012, 9:33:06 PM11/13/12
to Paweł Sikora, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org
On 11/13/2012 05:40 PM, Paweł Sikora wrote:
> ok, now i have full sysrq+w backtraces from shutdown process. i hope i'll help you.

This can only tell us what's the task in UNINTERRUPTABLE state, but with
out time info, we can't find out which one is the hung task...

Regards,
Michael Wang

Robert Hancock

unread,
Nov 13, 2012, 9:49:28 PM11/13/12
to Michael Wang, Paweł Sikora, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org
Probably all of the ones in D state waiting on NFS are the issue - but
as I understand it, with modern kernels processes are supposed to be
killable while waiting on NFS I/O. Maybe there's a bug that affects
this, though?

Michael Wang

unread,
Nov 13, 2012, 10:09:18 PM11/13/12
to Robert Hancock, Paweł Sikora, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org
That sounds possible, I thing Paweł can try to stop using NFS(if
possible) and take a look, if the issue disappear, then it's time to
report the bug to NFS folks.

Regards,
Michael Wang

Paweł Sikora

unread,
Nov 14, 2012, 4:11:08 PM11/14/12
to Michael Wang, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org
attaching backtraces reported by khungtaskd during reboot sequence.
dmesg.txt

Michael Wang

unread,
Nov 14, 2012, 8:41:57 PM11/14/12
to Paweł Sikora, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org, Alexander Viro, Fengguang Wu, Johannes Weiner, ja...@suse.cz, linu...@kvack.org
So it's blocked on __lock_page() for too long?
Add more experts in mm aspect to cc.

Regards,
Michael Wang

>
> attaching backtraces reported by khungtaskd during reboot sequence.
>

Jan Kara

unread,
Nov 15, 2012, 4:41:02 AM11/15/12
to Michael Wang, Paweł Sikora, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org, Alexander Viro, Fengguang Wu, Johannes Weiner, ja...@suse.cz, linu...@kvack.org
It's really NFS related. E.g. in trace
https://lkml.org/lkml/2012/11/14/657 we are waiting on PageWriteback bit
in fact - i.e. we have submitted data to the NFS server and are waiting for
its response that the data was written.

Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR

Michael Wang

unread,
Nov 15, 2012, 9:51:01 PM11/15/12
to Jan Kara, Paweł Sikora, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org, Alexander Viro, Fengguang Wu, Johannes Weiner, linu...@kvack.org
Do you mean, NFS lock some page, then wait on a respond which not come in?

Regards,
Michael Wang

>
> Honza

Jan Kara

unread,
Nov 15, 2012, 10:11:30 PM11/15/12
to Michael Wang, Jan Kara, Paweł Sikora, linux-...@vger.kernel.org, sta...@vger.kernel.org, torv...@linux-foundation.org, ar...@pld-linux.org, bag...@pld-linux.org, Alexander Viro, Fengguang Wu, Johannes Weiner, linu...@kvack.org
Well, in that particular trace we are not waiting for PageLock but on
PageWriteback bit as I wrote. But otherwise yes, NFS set PageWriteback bit
and sent the data from the page to the server. When the server responds,
the PageWriteback bit will get cleared and could continue. But the response
didn't come.

Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR
0 new messages