[PATCH 000 of 9] md: Introduction EXPLAIN PATCH SET HERE

NeilBrown

unread,

Mar 2, 2008, 7:20:08 PM3/2/08

to

Following are 9 patches for md in 2.6.25-rc2-mm1. All are fairly
simply bugfixes and are suitable for 2.6.25 (though I confess I
should have sent some of them eariler :-( ).

They include fixes for three bugs found but the scsi fault injection framework
(thanks!), and small raid10 read optimisation, and various other bits
and pieces.

NeilBrown

[PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.
[PATCH 002 of 9] md: Reduce CPU wastage on idle md array with a write-intent bitmap.
[PATCH 003 of 9] md: Guard against possible bad array geometry in v1 metadata.
[PATCH 004 of 9] md: Clean up irregularity with raid autodetect.
[PATCH 005 of 9] md: Make sure a reshape is started when device switches to read-write.
[PATCH 006 of 9] md: Lock access to rdev attributes properly
[PATCH 007 of 9] md: Don't attempt read-balancing for raid10 'far' layouts.
[PATCH 008 of 9] md: Fix possible raid1/raid10 deadlock on read error during resync.
[PATCH 009 of 9] md: The md RAID10 resync thread could cause a md RAID10 array deadlock
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

NeilBrown

unread,

Mar 2, 2008, 7:30:11 PM3/2/08

to

When we access attributes of an rdev (component device on an md array)
through sysfs, we really need to lock the array against concurrent
changes. We currently do that when we change an attribute, but not
when we read an attribute. We need to lock when reading as well else
rdev->mddev could become NULL while we are accessing it.

So add appropriate locking (mddev_lock) to rdev_attr_show.

rdev_size_store requires some extra care as well as it needs to unlock
the mddev while scanning other mddevs for overlapping regions. We
currently assume that rdev->mddev will still be unchanged after the
scan, but that cannot be certain. So take a copy of rdev->mddev for
use at the end of the function.

Signed-off-by: Neil Brown <ne...@suse.de>

### Diffstat output
./drivers/md/md.c | 35 ++++++++++++++++++++++++++---------
1 file changed, 26 insertions(+), 9 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-02-22 15:46:52.000000000 +1100
+++ ./drivers/md/md.c 2008-02-22 15:56:20.000000000 +1100
@@ -2001,9 +2001,11 @@ rdev_size_store(mdk_rdev_t *rdev, const
char *e;
unsigned long long size = simple_strtoull(buf, &e, 10);
unsigned long long oldsize = rdev->size;
+ mddev_t *my_mddev = rdev->mddev;
+
if (e==buf || (*e && *e != '\n'))
return -EINVAL;
- if (rdev->mddev->pers)
+ if (my_mddev->pers)
return -EBUSY;
rdev->size = size;
if (size > oldsize && rdev->mddev->external) {
@@ -2016,7 +2018,7 @@ rdev_size_store(mdk_rdev_t *rdev, const
int overlap = 0;
struct list_head *tmp, *tmp2;

- mddev_unlock(rdev->mddev);
+ mddev_unlock(my_mddev);
for_each_mddev(mddev, tmp) {
mdk_rdev_t *rdev2;

@@ -2036,7 +2038,7 @@ rdev_size_store(mdk_rdev_t *rdev, const
break;
}
}
- mddev_lock(rdev->mddev);
+ mddev_lock(my_mddev);
if (overlap) {
/* Someone else could have slipped in a size
* change here, but doing so is just silly.
@@ -2048,8 +2050,8 @@ rdev_size_store(mdk_rdev_t *rdev, const
return -EBUSY;
}
}
- if (size < rdev->mddev->size || rdev->mddev->size == 0)
- rdev->mddev->size = size;
+ if (size < my_mddev->size || my_mddev->size == 0)
+ my_mddev->size = size;
return len;
}

@@ -2070,10 +2072,21 @@ rdev_attr_show(struct kobject *kobj, str
{
struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
+ mddev_t *mddev = rdev->mddev;
+ ssize_t rv;

if (!entry->show)
return -EIO;
- return entry->show(rdev, page);
+
+ rv = mddev ? mddev_lock(mddev) : -EBUSY;
+ if (!rv) {
+ if (rdev->mddev == NULL)
+ rv = -EBUSY;
+ else
+ rv = entry->show(rdev, page);
+ mddev_unlock(mddev);
+ }
+ return rv;
}

static ssize_t
@@ -2082,15 +2095,19 @@ rdev_attr_store(struct kobject *kobj, st
{
struct rdev_sysfs_entry *entry = container_of(attr, struct rdev_sysfs_entry, attr);
mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
- int rv;
+ ssize_t rv;
+ mddev_t *mddev = rdev->mddev;

if (!entry->store)
return -EIO;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
- rv = mddev_lock(rdev->mddev);
+ rv = mddev ? mddev_lock(mddev): -EBUSY;
if (!rv) {
- rv = entry->store(rdev, page, length);
+ if (rdev->mddev == NULL)
+ rv = -EBUSY;
+ else
+ rv = entry->store(rdev, page, length);
mddev_unlock(rdev->mddev);
}
return rv;

NeilBrown

unread,

Mar 2, 2008, 7:30:13 PM3/2/08

to

From: "K.Tanaka" <k-ta...@ce.jp.nec.com>

This message describes another issue about md RAID10 found by
testing the 2.6.24 md RAID10 using new scsi fault injection framework.

Abstract:
When a scsi error results in disabling a disk during RAID10 recovery,
the resync threads of md RAID10 could stall.
This case, the raid array has already been broken and it may not matter.
But I think stall is not preferable. If it occurs, even shutdown or reboot
will fail because of resource busy.

The deadlock mechanism:
The r10bio_s structure has a "remaining" member to keep track of BIOs yet to be
handled when recovering. The "remaining" counter is incremented when building a BIO
in sync_request() and is decremented when finish a BIO in end_sync_write().

If building a BIO fails for some reasons in sync_request(), the "remaining" should be
decremented if it has already been incremented. I found a case where this decrement
is forgotten. This causes a md_do_sync() deadlock because md_do_sync() waits for
md_done_sync() called by end_sync_write(), but end_sync_write() never calls
md_done_sync() because of the "remaining" counter mismatch.

For example, this problem would be reproduced in the following case:

Personalities : [raid10]
md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F)
3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_]
[>....................] recovery = 2.2% (45376/1959808) finish=0.7min speed=45376K/sec

This case, sdf1 is recovering, sdb1 and sde1 are disabled.
An additional error with detaching sdd will cause a deadlock.

md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F)
3919616 blocks 64K chunks 2 near-copies [4/1] [_U__]
[=>...................] recovery = 5.0% (99520/1959808) finish=5.9min speed=5237K/sec

2739 ? S< 0:17 [md0_raid10]
28608 ? D< 0:00 [md0_resync]
28629 pts/1 Ss 0:00 bash
28830 pts/1 R+ 0:00 ps ax
31819 ? D< 0:00 [kjournald]

The resync thread keeps working, but actually it is deadlocked.

Patch:
By this patch, the remaining counter will be decremented if needed.

Signed-off-by: Neil Brown <ne...@suse.de>

### Diffstat output
./drivers/md/raid10.c | 2 ++
1 file changed, 2 insertions(+)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c 2008-03-03 09:56:53.000000000 +1100
+++ ./drivers/md/raid10.c 2008-03-03 11:08:28.000000000 +1100
@@ -1818,6 +1818,8 @@ static sector_t sync_request(mddev_t *md
if (j == conf->copies) {
/* Cannot recover, so abort the recovery */
put_buf(r10_bio);
+ if (rb2)
+ atomic_dec(&rb2->remaining);
r10_bio = rb2;
if (!test_and_set_bit(MD_RECOVERY_ERR, &mddev->recovery))
printk(KERN_INFO "raid10: %s: insufficient working devices for recovery.\n",

NeilBrown

unread,

Mar 2, 2008, 7:30:15 PM3/2/08

to

Make sure the data doesn't start before the end of the superblock
when the superblock is at the start of the device.

Signed-off-by: Neil Brown <ne...@suse.de>

### Diffstat output
./drivers/md/md.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-02-22 15:46:10.000000000 +1100
+++ ./drivers/md/md.c 2008-02-22 15:46:10.000000000 +1100
@@ -1105,7 +1105,11 @@ static int super_1_load(mdk_rdev_t *rdev
rdev->sb_size = le32_to_cpu(sb->max_dev) * 2 + 256;
bmask = queue_hardsect_size(rdev->bdev->bd_disk->queue)-1;
if (rdev->sb_size & bmask)
- rdev-> sb_size = (rdev->sb_size | bmask)+1;
+ rdev->sb_size = (rdev->sb_size | bmask) + 1;
+
+ if (minor_version
+ && rdev->data_offset < sb_offset + (rdev->sb_size/512))
+ return -EINVAL;

if (sb->level == cpu_to_le32(LEVEL_MULTIPATH))
rdev->desc_nr = -1;
@@ -1137,7 +1141,7 @@ static int super_1_load(mdk_rdev_t *rdev
else
ret = 0;
}
- if (minor_version)
+ if (minor_version)
rdev->size = ((rdev->bdev->bd_inode->i_size>>9) - le64_to_cpu(sb->data_offset)) / 2;
else
rdev->size = rdev->sb_offset;

NeilBrown

unread,

Mar 2, 2008, 7:30:15 PM3/2/08

to

When a raid1 array is stopped, all components currently get added
to the list for auto-detection. However we should really only add
components that were found by autodetection in the first place.
So add a flag to record that information, and use it.

Signed-off-by: Neil Brown <ne...@suse.de>

### Diffstat output
./drivers/md/md.c | 4 +++-
./include/linux/raid/md_k.h | 1 +
2 files changed, 4 insertions(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-02-22 15:46:10.000000000 +1100

+++ ./drivers/md/md.c 2008-02-22 15:46:25.000000000 +1100
@@ -1503,7 +1503,8 @@ static void export_rdev(mdk_rdev_t * rde
free_disk_sb(rdev);
list_del_init(&rdev->same_set);
#ifndef MODULE
- md_autodetect_dev(rdev->bdev->bd_dev);
+ if (test_bit(AutoDetected, &rdev->flags))
+ md_autodetect_dev(rdev->bdev->bd_dev);
#endif
unlock_rdev(rdev);
kobject_put(&rdev->kobj);
@@ -6025,6 +6026,7 @@ static void autostart_arrays(int part)
MD_BUG();
continue;
}
+ set_bit(AutoDetected, &rdev->flags);
list_add(&rdev->same_set, &pending_raid_disks);
i_passed++;
}

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h 2008-02-22 15:46:25.000000000 +1100
+++ ./include/linux/raid/md_k.h 2008-02-22 15:46:25.000000000 +1100
@@ -83,6 +83,7 @@ struct mdk_rdev_s
#define BarriersNotsupp 5 /* BIO_RW_BARRIER is not supported */
#define AllReserved 6 /* If whole device is reserved for
* one array */
+#define AutoDetected 7 /* added by auto-detect */

int desc_nr; /* descriptor index in the superblock */
int raid_disk; /* role of device in array */

NeilBrown

unread,

Mar 2, 2008, 7:30:17 PM3/2/08

to

From: Keld Simonsen <ke...@dkuug.dk>

This patch changes the disk to be read for layout "far > 1" to always be
the disk with the lowest block address.

Thus the chunks to be read will always be (for a fully functioning array)
from the first band of stripes, and the raid will then work as a raid0
consisting of the first band of stripes.

Some advantages:

The fastest part which is the outer sectors of the disks involved will be used.
The outer blocks of a disk may be as much as 100 % faster than the inner blocks.

Average seek time will be smaller, as seeks will always be confined to the
first part of the disks.

Mixed disks with different performance characteristics will work better,
as they will work as raid0, the sequential read rate will be number of
disks involved times the IO rate of the slowest disk.

If a disk is malfunctioning, the first disk which is working, and has the lowest
block address for the logical block will be used.

Signed-off-by: Keld Simonsen <ke...@dkuug.dk>
Signed-off-by: Neil Brown <ne...@suse.de>

### Diffstat output
./drivers/md/raid10.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c 2008-03-03 09:35:29.000000000 +1100
+++ ./drivers/md/raid10.c 2008-03-03 09:35:52.000000000 +1100
@@ -537,7 +537,8 @@ static int read_balance(conf_t *conf, r1
current_distance = abs(r10_bio->devs[slot].addr -
conf->mirrors[disk].head_position);

- /* Find the disk whose head is closest */
+ /* Find the disk whose head is closest,
+ * or - for far > 1 - find the closest to partition beginning */

for (nslot = slot; nslot < conf->copies; nslot++) {
int ndisk = r10_bio->devs[nslot].devnum;
@@ -557,8 +558,13 @@ static int read_balance(conf_t *conf, r1
slot = nslot;
break;
}
- new_distance = abs(r10_bio->devs[nslot].addr -
- conf->mirrors[ndisk].head_position);
+
+ /* for far > 1 always use the lowest address */
+ if (conf->far_copies > 1)
+ new_distance = r10_bio->devs[nslot].addr;
+ else
+ new_distance = abs(r10_bio->devs[nslot].addr -
+ conf->mirrors[ndisk].head_position);
if (new_distance < current_distance) {
current_distance = new_distance;
disk = ndisk;

NeilBrown

unread,

Mar 2, 2008, 7:30:17 PM3/2/08

to

Thanks to K.Tanaka and the scsi fault injection framework, here is
a fix for another possible deadlock in raid1/raid10 error handing.

If a read request returns an error while a resync is happening and
a resync request is pending, the attempt to fix the error will block
until the resync progresses, and the resync will block until the
read request completes. Thus a deadlock.

This patch fixes the problem.

Cc: "K.Tanaka" <k-ta...@ce.jp.nec.com>
Signed-off-by: Neil Brown <ne...@suse.de>

### Diffstat output
./drivers/md/raid1.c | 11 +++++++++--
./drivers/md/raid10.c | 11 +++++++++--
2 files changed, 18 insertions(+), 4 deletions(-)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c 2008-03-03 11:03:39.000000000 +1100
+++ ./drivers/md/raid10.c 2008-03-03 09:56:53.000000000 +1100
@@ -747,13 +747,20 @@ static void freeze_array(conf_t *conf)
/* stop syncio and normal IO and wait for everything to
* go quiet.
* We increment barrier and nr_waiting, and then
- * wait until barrier+nr_pending match nr_queued+2
+ * wait until nr_pending match nr_queued+1
+ * This is called in the context of one normal IO request
+ * that has failed. Thus any sync request that might be pending
+ * will be blocked by nr_pending, and we need to wait for
+ * pending IO requests to complete or be queued for re-try.
+ * Thus the number queued (nr_queued) plus this request (1)
+ * must match the number of pending IOs (nr_pending) before
+ * we continue.
*/
spin_lock_irq(&conf->resync_lock);
conf->barrier++;
conf->nr_waiting++;
wait_event_lock_irq(conf->wait_barrier,
- conf->barrier+conf->nr_pending == conf->nr_queued+2,
+ conf->nr_pending == conf->nr_queued+1,
conf->resync_lock,
({ flush_pending_writes(conf);
raid10_unplug(conf->mddev->queue); }));

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c 2008-03-03 11:03:39.000000000 +1100
+++ ./drivers/md/raid1.c 2008-03-03 09:56:52.000000000 +1100
@@ -704,13 +704,20 @@ static void freeze_array(conf_t *conf)
/* stop syncio and normal IO and wait for everything to
* go quite.
* We increment barrier and nr_waiting, and then
- * wait until barrier+nr_pending match nr_queued+2
+ * wait until nr_pending match nr_queued+1
+ * This is called in the context of one normal IO request
+ * that has failed. Thus any sync request that might be pending
+ * will be blocked by nr_pending, and we need to wait for
+ * pending IO requests to complete or be queued for re-try.
+ * Thus the number queued (nr_queued) plus this request (1)
+ * must match the number of pending IOs (nr_pending) before
+ * we continue.
*/
spin_lock_irq(&conf->resync_lock);
conf->barrier++;
conf->nr_waiting++;
wait_event_lock_irq(conf->wait_barrier,
- conf->barrier+conf->nr_pending == conf->nr_queued+2,
+ conf->nr_pending == conf->nr_queued+1,
conf->resync_lock,
({ flush_pending_writes(conf);
raid1_unplug(conf->mddev->queue); }));

NeilBrown

unread,

Mar 2, 2008, 7:30:18 PM3/2/08

to

On an md array with a write-intent bitmap, a thread wakes up every few
seconds and scans the bitmap looking for work to do. If the
array is idle, there will be no work to do, but a lot of scanning is
done to discover this.

So cache the fact that the bitmap is completely clean, and avoid
scanning the whole bitmap when the cache is known to be clean.

Signed-off-by: Neil Brown <ne...@suse.de>

### Diffstat output
./drivers/md/bitmap.c | 19 +++++++++++++++++--
./include/linux/raid/bitmap.h | 2 ++
2 files changed, 19 insertions(+), 2 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c 2008-02-22 15:45:56.000000000 +1100
+++ ./drivers/md/bitmap.c 2008-02-22 15:45:56.000000000 +1100
@@ -1047,6 +1047,11 @@ void bitmap_daemon_work(struct bitmap *b
if (time_before(jiffies, bitmap->daemon_lastrun + bitmap->daemon_sleep*HZ))
return;
bitmap->daemon_lastrun = jiffies;
+ if (bitmap->allclean) {
+ bitmap->mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
+ return;
+ }
+ bitmap->allclean = 1;

for (j = 0; j < bitmap->chunks; j++) {
bitmap_counter_t *bmc;
@@ -1068,8 +1073,10 @@ void bitmap_daemon_work(struct bitmap *b
clear_page_attr(bitmap, page, BITMAP_PAGE_NEEDWRITE);

spin_unlock_irqrestore(&bitmap->lock, flags);
- if (need_write)
+ if (need_write) {
write_page(bitmap, page, 0);
+ bitmap->allclean = 0;
+ }
continue;
}

@@ -1098,6 +1105,9 @@ void bitmap_daemon_work(struct bitmap *b
/*
if (j < 100) printk("bitmap: j=%lu, *bmc = 0x%x\n", j, *bmc);
*/
+ if (*bmc)
+ bitmap->allclean = 0;
+
if (*bmc == 2) {
*bmc=1; /* maybe clear the bit next time */
set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
@@ -1132,6 +1142,8 @@ void bitmap_daemon_work(struct bitmap *b
}
}

+ if (bitmap->allclean == 0)
+ bitmap->mddev->thread->timeout = bitmap->daemon_sleep * HZ;
}

static bitmap_counter_t *bitmap_get_counter(struct bitmap *bitmap,
@@ -1226,6 +1238,7 @@ int bitmap_startwrite(struct bitmap *bit
sectors -= blocks;
else sectors = 0;
}
+ bitmap->allclean = 0;
return 0;
}

@@ -1296,6 +1309,7 @@ int bitmap_start_sync(struct bitmap *bit
}
}
spin_unlock_irq(&bitmap->lock);
+ bitmap->allclean = 0;
return rv;
}

@@ -1332,6 +1346,7 @@ void bitmap_end_sync(struct bitmap *bitm
}
unlock:
spin_unlock_irqrestore(&bitmap->lock, flags);
+ bitmap->allclean = 0;
}

void bitmap_close_sync(struct bitmap *bitmap)
@@ -1399,7 +1414,7 @@ static void bitmap_set_memory_bits(struc
set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
}
spin_unlock_irq(&bitmap->lock);
-
+ bitmap->allclean = 0;
}

/* dirty the memory and file bits for bitmap chunks "s" to "e" */

diff .prev/include/linux/raid/bitmap.h ./include/linux/raid/bitmap.h
--- .prev/include/linux/raid/bitmap.h 2008-02-22 15:45:56.000000000 +1100
+++ ./include/linux/raid/bitmap.h 2008-02-22 15:45:56.000000000 +1100
@@ -235,6 +235,8 @@ struct bitmap {

unsigned long flags;

+ int allclean;
+
unsigned long max_write_behind; /* write-behind mode */
atomic_t behind_writes;

NeilBrown

unread,

Mar 2, 2008, 7:30:18 PM3/2/08

to

A resync/reshape/recovery thread will refuse to progress when the
array is marked read-only. So whenever it mark it not read-only, it
is important to wake up thread resync thread.
There is one place we didn't do this.

The problem manifests if the start_ro module parameters is set, and a
raid5 array that is in the middle of a reshape (restripe) is started.
The array will initially be semi-read-only (meaning it acts like it is
readonly until the first write). So the reshape will not proceed.

On the first write, the array will become read-write, but the reshape
will not be started, and there is no event which will ever restart
that thread.

Signed-off-by: Neil Brown <ne...@suse.de>

### Diffstat output
./drivers/md/md.c | 1 +
1 file changed, 1 insertion(+)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2008-02-22 15:46:25.000000000 +1100
+++ ./drivers/md/md.c 2008-02-22 15:46:52.000000000 +1100
@@ -5356,6 +5356,7 @@ void md_write_start(mddev_t *mddev, stru
mddev->ro = 0;
set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
md_wakeup_thread(mddev->thread);
+ md_wakeup_thread(mddev->sync_thread);
}
atomic_inc(&mddev->writes_pending);
if (mddev->in_sync) {