I keep getting soft lockups; the symptoms appear to be that user
processes become deadlocked when they try to access the disk, but the
kernel doesn't notice that anything is wrong.
It appears that the lockups happen when:
(a) my software (mdadm) RAID mirror is resyncing or checking
(b) some other program does heavy disk IO
I'm currently trying to work around the problem by disabling the
system's backups (which are a prime trigger in category (b)) until the
post-crash mirror resync is complete, combined with setting the sysctl
dev.raid.speed_limit_max to 1000. Just disabling the backups wasn't
sufficient because, it seems, various recovery tasks in the most
recent crash are sufficient to trigger the bug.
I observe that the message from the md driver, and the presence of the
sysctls dev.raid.speed_limit_{min,max}, suggest that the md driver is
using a fundamentally incorrect algorithm which depends on guessing
the available IO bandwidth.
Actual hard data:
* chiark locked up at 0355 on the 2nd of May. It did so shortly after
the backups start. chiark's backups are done with LVM snapshots and
rsync. I don't any more have information about exactly at what
stage the backups were at.
* chiark locked up again at 0355 on Sunday morning the 6th of June.
In this case I was able to see that the system had successfully
created and mounted the LVM snapshot as the rsync process on the
backup master system was still present.
* Note that both these two crashes happened on the 1st Sunday of the
month. Investigating the kernel logs for the 2nd crash show this:
Jun 6 00:57:01 chiark kernel: [3250794.079478] md: data-check of RAID array md127
Jun 6 00:57:01 chiark kernel: [3250794.079479] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Jun 6 00:57:01 chiark kernel: [3250794.079479] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Jun 6 00:57:01 chiark kernel: [3250794.079479] md: using 128k window, over a total of 222297980 blocks.
However there is no sign of it completing.
* Shortly after I power cycled the system to get it back up, it locked
up again. At that point, the RAID was doing a post-crash resync,
and the crash happened at the point where the backup master
attempted to start on the next partition.
* Some research dug up some reports of first-Sunday-of-the-month
crashes associated with RAID integrity checks.[1] So I tried setting
the max IO bandwidth sysctl to the lowest reasonable value (by hand)
and left the system running. However, it locked up again as the
backups started.
* I power cycled the machine again and disabled the backups and made
sure they were truly stopped. However chiark locked up again mere
minutes after coming back up.
* I have now power cycled chiark once more and this time I set the
max IO bandwidth sysctl. So far it seems to have stayed up for
about half an hour.
If nothing more goes wrong I intend to leave the backups disabled
until the RAID finishes its resync, which we can expect to be some
time on Wednesday night.
chiark is running LVM-on-MD-on-LVM. So the two "disks" which form the
mirror which is resyncing are actually LVM LVs, and that mirror is
itself an LVM PV. The underlying hardware is a Dell R210 with two
WD SATA disks (WD7500AYPS-0).
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/212684
seems relevant for example.
Ian.
lvs:
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
mirror-base chiark-early-a -wi-ao 400.00G
stripe-base chiark-early-a -wi-ao 52.00G
swap chiark-early-a -wi-ao 4.00G
mirror-base chiark-early-b -wi-ao 400.00G
stripe-base chiark-early-b -wi-ao 52.00G
swap chiark-early-b -wi-ao 4.00G
lv-quicksand vg-chiark-stripe -wi-ao 40.00G
twomack-relations vg-chiark-stripe -wi-ao 60.00G
lv-log vg-main -wi-ao 3.00G
lv-mail vg-main -wi-ao 3.00G
lv-news vg-main -wi-ao 15.94G
lv-root vg-main -wi-ao 1.00G
lv-u vg-main -wi-ao 9.75G
lv-u2 vg-main -wi-ao 12.25G
lv-u3 vg-main -wi-ao 7.06G
lv-usr vg-main -wi-ao 4.12G
lv-var vg-main -wi-ao 5.94G
lv-volt vg-main -wi-ao 77.44G
pvs:
PV VG Fmt Attr PSize PFree
/dev/md_chiark-mirror vg-main lvm2 a- 211.94G 72.44G
/dev/md_chiark-stripe vg-chiark-stripe lvm2 a- 103.94G 3.94G
/dev/sdb2 chiark-early-a lvm2 a- 464.00G 8.00G
/dev/sdc2 chiark-early-b lvm2 a- 464.00G 8.00G
/proc/mdstat:
Personalities : [raid0] [raid1]
md126 : active raid0 dm-2[0] dm-5[1]
109051776 blocks super 1.1 64k chunks
md127 : active raid1 dm-0[3] dm-3[2]
222297980 blocks super 1.1 [2/2] [UU]
[>....................] resync = 0.9% (2014848/222297980) finish=3659.9min speed=1000K/sec
mdadm -D -vv --scan:
/dev/md127:
Version : 01.01
Creation Time : Sat Mar 6 20:06:57 2010
Raid Level : raid1
Array Size : 222297980 (212.00 GiB 227.63 GB)
Used Dev Size : 444595960 (424.00 GiB 455.27 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 127
Persistence : Superblock is persistent
Update Time : Mon Jun 7 10:13:05 2010
State : active, resyncing
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Rebuild Status : 0% complete
Name : chiark:chiark-mirror (local to host chiark)
UUID : [***]
Events : 9033
Number Major Minor RaidDevice State
3 254 0 0 active sync /dev/dm-0
2 254 3 1 active sync /dev/dm-3
/dev/md126:
Version : 01.01
Creation Time : Thu Mar 11 13:52:05 2010
Raid Level : raid0
Array Size : 109051776 (104.00 GiB 111.67 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 126
Persistence : Superblock is persistent
Update Time : Mon Jun 7 10:47:15 2010
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Chunk Size : 64K
Name : chiark:chiark-stripe (local to host chiark)
UUID : [***]
Events : 5
Number Major Minor RaidDevice State
0 254 2 0 active sync /dev/dm-2
1 254 5 1 active sync /dev/dm-5
brw-r----- 1 root disk 254, 0 Jun 7 10:13 /dev/dm-0
brw-r----- 1 root disk 254, 1 Jun 7 10:13 /dev/dm-1
brw-r----- 1 root disk 254, 2 Jun 7 10:13 /dev/dm-2
brw-r----- 1 root disk 254, 3 Jun 7 10:13 /dev/dm-3
brw-r----- 1 root disk 254, 4 Jun 7 10:13 /dev/dm-4
brw-r----- 1 root disk 254, 5 Jun 7 10:13 /dev/dm-5
brw-r----- 1 root disk 254, 6 Jun 7 10:13 /dev/dm-6
brw-r----- 1 root disk 254, 7 Jun 7 10:13 /dev/dm-7
brw-r----- 1 root disk 254, 8 Jun 7 10:13 /dev/dm-8
/dev/mapper/:
total 0
drwxr-xr-x 2 root root 420 Jun 7 10:43 ./
drwxr-xr-x 19 root root 3840 Jun 7 10:43 ../
brw-rw---- 1 root disk 254, 0 Jun 7 10:13 chiark--early--a-mirror--base
brw-rw---- 1 root disk 254, 2 Jun 7 10:13 chiark--early--a-stripe--base
brw-rw---- 1 root disk 254, 1 Jun 7 10:13 chiark--early--a-swap
brw-rw---- 1 root disk 254, 3 Jun 7 10:13 chiark--early--b-mirror--base
brw-rw---- 1 root disk 254, 5 Jun 7 10:13 chiark--early--b-stripe--base
brw-rw---- 1 root disk 254, 4 Jun 7 10:13 chiark--early--b-swap
crw-rw---- 1 root root 10, 60 Jun 7 10:13 control
brw-rw---- 1 root disk 254, 20 Jun 7 10:13 vg--chiark--stripe-lv--quicksand
brw-rw---- 1 root disk 254, 19 Jun 7 10:13 vg--chiark--stripe-twomack--relations
brw-rw---- 1 root disk 254, 6 Jun 7 10:13 vg--main-lv--log
brw-rw---- 1 root disk 254, 7 Jun 7 10:13 vg--main-lv--mail
brw-rw---- 1 root disk 254, 8 Jun 7 10:13 vg--main-lv--news
brw-rw---- 1 root disk 254, 18 Jun 7 10:13 vg--main-lv--root
brw-rw---- 1 root disk 254, 10 Jun 7 10:13 vg--main-lv--u
brw-rw---- 1 root disk 254, 13 Jun 7 10:13 vg--main-lv--u2
brw-rw---- 1 root disk 254, 14 Jun 7 10:13 vg--main-lv--u3
brw-rw---- 1 root disk 254, 15 Jun 7 10:13 vg--main-lv--usr
brw-rw---- 1 root disk 254, 16 Jun 7 10:13 vg--main-lv--var
brw-rw---- 1 root disk 254, 17 Jun 7 10:13 vg--main-lv--volt
We really need to see the kernel messages reporting soft-lockup.
Note that your disk configuration is unusual and probably not
well-tested by others. It is unlikely that anyone in the kernel team
will be able to debug this as I don't think any of us have particular
expertise in this area. If you have the opportunity, it would be
helpful if you could test a newer kernel version. This would give us a
clue as to whether the bug has subsequently been fixed upstream, and if
not then it would be the basis for an upstream bug report.
Ben.
--
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.
There aren't any. Or, if there are, it isn't printing them to the
serial console. Perhaps it is trying to send them only to syslogd
which obviously can't write to the disk either.
I know that the serial console is working after the crash as it echoes
my first carriage return (although then login or the shell wedges
before being able to print a prompt). I could have asked for a magic
sysrq process dump but given that the process table is probably full I
expect this would take many hours at 9600bps.
> Note that your disk configuration is unusual and probably not
> well-tested by others. It is unlikely that anyone in the kernel team
> will be able to debug this as I don't think any of us have particular
> expertise in this area.
Search the web suggests that symptoms very similar to mine are not
uncommon, including instances without soft lockup messages, and none
of the other users seem to have a similar disk layout.
I can't easily test this theory but I think the unusual disk layout is
probably simply making a race easier to trigger.
> If you have the opportunity, it would be
> helpful if you could test a newer kernel version. This would give us a
> clue as to whether the bug has subsequently been fixed upstream, and if
> not then it would be the basis for an upstream bug report.
Unfortunately testing this bug on the live system involves (a) risking
a crash and then if the test fails (b) an extremely vulnerable system
without backups for the following four days.
I'll see if I can borrow a spare R210 from Jump, in which case I may
be able to reproduce the problem in controlled conditions on my coffee
table at home (and with access to the VGA console). Which kernel
should I test in that case ?
Ian.
--
To UNSUBSCRIBE, email to debian-bugs-...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org
Even if you can't get a process dump, you can get some useful
information with:
'd' - show locks held
'l' - show backtrace for active CPUs
'w' - show uninterruptible tasks
> > Note that your disk configuration is unusual and probably not
> > well-tested by others. It is unlikely that anyone in the kernel team
> > will be able to debug this as I don't think any of us have particular
> > expertise in this area.
>
> Search the web suggests that symptoms very similar to mine are not
> uncommon, including instances without soft lockup messages, and none
> of the other users seem to have a similar disk layout.
>
> I can't easily test this theory but I think the unusual disk layout is
> probably simply making a race easier to trigger.
Thinking of some kind of lock-dependency bug? Blocking on a mutex for a
long period should still trigger a soft-lockup message. Since there are
no messages from the kernel it's something of a mystery what's going on.
> > If you have the opportunity, it would be
> > helpful if you could test a newer kernel version. This would give us a
> > clue as to whether the bug has subsequently been fixed upstream, and if
> > not then it would be the basis for an upstream bug report.
>
> Unfortunately testing this bug on the live system involves (a) risking
> a crash and then if the test fails (b) an extremely vulnerable system
> without backups for the following four days.
That's what I suspected.
> I'll see if I can borrow a spare R210 from Jump, in which case I may
> be able to reproduce the problem in controlled conditions on my coffee
> table at home (and with access to the VGA console). Which kernel
> should I test in that case ?
Please try 2.6.34 from experimental.
Right, thanks.
> 'd' - show locks held
> 'l' - show backtrace for active CPUs
> 'w' - show uninterruptible tasks
I'll try these although I suspect thousands of uninterruptible tasks.
> > Search the web suggests that symptoms very similar to mine are not
> > uncommon, including instances without soft lockup messages, and none
> > of the other users seem to have a similar disk layout.
> >
> > I can't easily test this theory but I think the unusual disk layout is
> > probably simply making a race easier to trigger.
>
> Thinking of some kind of lock-dependency bug? Blocking on a mutex for a
> long period should still trigger a soft-lockup message. Since there are
> no messages from the kernel it's something of a mystery what's going on.
The RAID system (md driver) has a separate mechanism for blocking
writes, which it calls a "barrier". I think it is quite capable of
indefinitely blocking all writes to a device without necessarily
triggering the soft lockup detector.
> > I'll see if I can borrow a spare R210 from Jump, in which case I may
> > be able to reproduce the problem in controlled conditions on my coffee
> > table at home (and with access to the VGA console). Which kernel
> > should I test in that case ?
>
> Please try 2.6.34 from experimental.
Will do. I'll get back to you.
Thanks,
I/O barriers are block I/O operations (not specific to md) that inhibit
reordering of read and write operations. They certainly should not be
blocking operations. Also, device-mapper did not support barriers until
after 2.6.26 so md will not be using them in the configuration you are
using.
No, I think there are two meanings of the word "barrier". AFAICT md
has its own thing which it confusingly calls a "barrier"; it can be
"raised" and "lowered".
Oh, great! I wondered whether this was the case but I could only find
discussion of md vs I/O barriers. Do you have any reference for
documentation of md barriers?
No. I just stumbled across them in the source. Particularly this in
drivers/md/raid1.c:
/* Barriers....
* Sometimes we need to suspend IO while we do something else,
* either some resync/recovery, or reconfigure the array.
* To do this we raise a 'barrier'.
* The 'barrier' is a counter that can be raised multiple times
* to count how many activities are happening which preclude
* normal IO.
* We can only raise the barrier if there is no pending IO.
* i.e. if nr_pending == 0.
* We choose only to raise the barrier if no-one is waiting for the
* barrier to go down. This means that as soon as an IO request
* is ready, no other operations which require a barrier will start
* until the IO request has had a chance.
*
* So: regular IO calls 'wait_barrier'. When that returns there
* is no backgroup IO happening, It must arrange to call
* allow_barrier when it has finished its IO.
* backgroup IO calls must call raise_barrier. Once that returns
* there is no normal IO happeing. It must arrange to call
* lower_barrier when the particular background IO completes.
*/
But I do see a deadlock on the barrier between resync and normal I/O, as
you suspected. Every process is blocked in wait_barrier(),
raise_barrier() or a filesystem lock.
The only explanations I can think of are:
1. The regular I/O requests were miscounted so that the resync process
will wait forever for them to complete in raise_barrier(), while the
other processes wait forever for the barrier to be lowered. (It seems
like the directions of this barrier are inverted. Maybe it is really a
bollard.)
2. The wait condition of one of the waiters was met, but it wasn't
woken.
3. A thread which raised the barrier failed to lower the barrier.
4. A thread which raised the barrier caused paging.
Unfortunately I can't see any evidence for any of these.
The following patch should give us some chance of detecting cases 1 and
3. The instructions at
<http://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official> explain how to rebuild an official kernel package with a patch applied.
Ben.
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -657,6 +657,7 @@
/* block any new IO from starting */
conf->barrier++;
+ WARN_ON(conf->barrier != 1);
/* No wait for all pending IO to complete */
wait_event_lock_irq(conf->wait_barrier,
@@ -671,6 +672,7 @@
{
unsigned long flags;
spin_lock_irqsave(&conf->resync_lock, flags);
+ WARN_ON(conf->barrier == 0);
conf->barrier--;
spin_unlock_irqrestore(&conf->resync_lock, flags);
wake_up(&conf->wait_barrier);
@@ -694,6 +696,7 @@
{
unsigned long flags;
spin_lock_irqsave(&conf->resync_lock, flags);
+ WARN_ON(conf->nr_pending == 0);
conf->nr_pending--;
spin_unlock_irqrestore(&conf->resync_lock, flags);
wake_up(&conf->wait_barrier);
--- END ----
My attempt to compile the kernel with kgdb support failed. Something
in the Debian kernel packaging thingy complained like this:
ABI has changed! Refusing to continue.
Added symbols:
kgdb_breakpoint module: vmlinux,
version: 0x01e1a8de, export: EXPORT_SYMBOL_GPL
...
Changed symbols:
__audit_inode_child module: vmlinux, version: 0xead6af60 -> 0xe3835d2d, export: EXPORT_SYMBOL_GPL
__fatal_signal_pending module: vmlinux, version: 0xb5428735 -> 0xee8d0603, export: EXPORT_SYMBOL
[lots more]
I've run out of time to do more on this now.
If you remove the 'debian/abi' directory this check will be skipped.