This is seen on a supermicro H8DM3-2 and H8DMi-2 motherboards, 16 GB RAM, 2222 SE or 2216 processors.
Not seen on a Uniwide 3546ES, or on a Supermicro H8DM8 with 8222 CPUS.
kswapd sits at (according to KDB)
__spinlock_irqsave + 0x15
prepare_to_wait + 0x15
kswapd + 0xe3
kthread + 0x47
berkley
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
This is the pgdat->kswapd_wait waitqueue lock, by the looks?
It should be basically impossible to get this lock wrong, so
perhaps what happens is some memory corruption which is padded
(or some other random change) by spinlock debugging?
Is this very reproducable? Does your dmesg have any funny
messages after performing this heavy write load with
CONFIG_DEBUG_SPINLOCK set?
No errors, no dmesg, nothing with debug_spinlock set.
<sysrq> shows lots (when it works), but by then too many things are
locked up to be of much use. I can get into KDB and look around
(2.6.22 for kdb - it hangs there too). Even access to the local disk is
blocked.
Processes in core and running remain there (iostat, top, ...).
I personally think the bios are suspect on the PCIe, as symptoms change with
the bios rev. I did a major paper on SAS performance with one H8DMi,
but it got a bios rev, and now crashes. Missed interrupt? APIC sending
an interrupt to multiple cpus? I don't know.
Tell me what to look at, and I can get you the info. It usually takes 20
seconds
to go bang, using either the LSI8888ELP or the rocket raid 2340. Other
controllers
are too slow. 2.6.20 does not lock up. It is also 200MB/Sec slower in
writing :-)
thanks for the response.
berkley
OK, it does sound suspiciously like a hardware bug, or some
unrelated software bug that is causing memory scribbles...
A few things you could do.
One is that you could verify that it indeed is the kswapd_wait
spinlock that it is spinning on, and then when you see the lockup,
you could verify that no other tasks are holding the lock. (it is
quite an inner lock, so you shouldn't have to wade through call
chains...). That would confirm corruption. Dumping the lock
contents and the fields in the structure around the lock might
give a clue.
You could put the spinlock somewhere else and see what happens
(move it around in the structure, or get even more creative...).
or do something like have 2 spinlocks, and when you encounter
the lockup, verify whether or not they agree.
(It sounds like you're pretty capable, but if you want me to have
a look at doing a patch or two to help, let me know.)
Another is to bisect the problem, however as you say the kernel
is going slower, so you may just bisect to the point where it
is sustaining enough load to trigger the bug, so this may not be
worth you time just yet.
You could _try_ turning on slab debugging. If there is random
corruption, it might get caught. Maybe it will just change
things enough to hide the problem though.
Thanks for reporting!
[ 626.028180] eth0: too many iterations (6) in nv_nic_irq.
[ 626.167583] eth0: too many iterations (6) in nv_nic_irq.
[ 626.206729] eth0: too many iterations (6) in nv_nic_irq.
[ 626.400171] eth0: too many iterations (6) in nv_nic_irq.
[ 626.446374] eth0: too many iterations (6) in nv_nic_irq.
followed by the crash at
forcedeth:nv_nic_irq_optimized+0x89
handle_IRQ_event+0x25
handle_edge_IRQ+0xe3
do_IRQ
berkley
This e-mail and any documents accompanying it may contain legally privileged and/or confidential information belonging to Exegy, Inc. Such information may be protected from disclosure by law. The information is intended for use by only the addressee. If you are not the intended recipient, you are hereby notified that any disclosure or use of the information is strictly prohibited. If you have received this e-mail in error, please immediately contact the sender by e-mail or phone regarding instructions for return or destruction and do not use or disclose the content to others.
OK, thanks. Probably we should try to fix this problem before
attempting the more elusive looking corruption.
CC'ed forcedeth maintainer and netdev.
Can you reiterate what your hardware is (including lspci)?