[Lustre-discuss] md and mdadm in 1.8.7-wc1

Samuel Aparicio

unread,

Mar 20, 2012, 1:05:42 AM3/20/12

to Lustre Discussion list

I am wondering if anyone has experienced issues with md / mdadm in the 1.8.7-wc1 patched server kernels.?
we have historically used software raid on our OSS machines because it provided a 20-30% throughput in our hands, over
raid provided from our storage arrays (coraid ATA over ethernet shelves). In 1.8.5 this has worked more or less flawlessly,
but we now have new storage, with 3Tb rather than 2Tb disks and new servers with 1.8.7-wc1 patched kernels.

md is unable to reliably shut down and restart arrays after the machines have been rebooted (cleanly) - the disks are no
longer recognized as part of the arrays they were created within. In the kernel log we have seen the following messages below,
which include the following:

md: bug in file drivers/md/md.c, line 1677

looking through the mdadm changelogs, it seems like there are some possible patches for md in 2.6.18 kernels but I cannot tell
if they are applied here, or whether this is even relevant.

I am not clear whether this is an issue with 3Tb disks, or something else related to mdadm and the patched server kernel. My suspicion
is that something has broken with > 2.2Tb disks.

Does anyone have any ideas about this?

thanks
sam aparicio

---------------
Mar 19 21:34:48 OST3 kernel: md: **********************************
Mar 19 21:34:48 OST3 kernel:
Mar 19 21:35:20 OST3 kernel: md: bug in file drivers/md/md.c, line 1677
Mar 19 21:35:20 OST3 kernel:
Mar 19 21:35:20 OST3 kernel: md: **********************************
Mar 19 21:35:20 OST3 kernel: md: * <COMPLETE RAID STATE PRINTOUT> *
Mar 19 21:35:20 OST3 kernel: md: **********************************
Mar 19 21:35:20 OST3 kernel: md142:
Mar 19 21:35:20 OST3 kernel: md141:
Mar 19 21:35:20 OST3 kernel: md140: <etherd/e14.16><etherd/e14.15><etherd/e14.14><etherd/e14.13><etherd/e14.12><etherd/e14.11><etherd/e14.10><etherd/e14.9><etherd/e14.8><etherd/e14.7><etherd/e14.6><etherd
/e14.5><etherd/e14.4><etherd/e14.3><etherd/e14.2><etherd/e14.1><etherd/e14.0>
Mar 19 21:35:20 OST3 kernel: md: rdev etherd/e14.16, SZ:2930265344 F:0 S:0 DN:16
Mar 19 21:35:20 OST3 kernel: md: rdev superblock:
Mar 19 21:35:20 OST3 kernel: md: SB: (V:1.0.0) ID:<9859f274.34313a61.00000030.00000000> CT:5d3314af
Mar 19 21:35:20 OST3 kernel: md: L234772919 S861164367 ND:1970037550 RD:1919251571 md1667457582 LO:65536 CS:196610
Mar 19 21:35:20 OST3 kernel: md: UT:00000800 ST:0 AD:1565563648 WD:1 FD:8 SD:0 CSUM:00000000 E:00000000
Mar 19 21:35:20 OST3 kernel: D 0: DISK<N:-1,(-1,-1),R:-1,S:-1>
Mar 19 21:35:20 OST3 kernel: D 1: DISK<N:-1,(-1,-1),R:-1,S:-1>
Mar 19 21:35:20 OST3 kernel: D 2: DISK<N:-1,(-1,-1),R:-1,S:-1>
Mar 19 21:35:20 OST3 kernel: D 3: DISK<N:-1,(-1,-1),R:-1,S:-1>
Mar 19 21:35:20 OST3 kernel: md: THIS: DISK<N:0,(0,0),R:0,S:0>
< output truncated >

Professor Samuel Aparicio BM BCh PhD FRCPath
Nan and Lorraine Robertson Chair UBC/BC Cancer Agency
675 West 10th, Vancouver V5Z 1L3, Canada.
office: +1 604 675 8200 lab website http://molonc.bccrc.ca

_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Robin Humble

unread,

Mar 20, 2012, 2:46:12 AM3/20/12

to Samuel Aparicio, Lustre Discussion list

On Mon, Mar 19, 2012 at 10:05:42PM -0700, Samuel Aparicio wrote:
>I am wondering if anyone has experienced issues with md / mdadm in the 1.8.7-wc1 patched server kernels.?

I've seen an issue:
http://jira.whamcloud.com/browse/LU-1115
althought it looks quite different to your problem... still, you might
hit that problem next.

>we have historically used software raid on our OSS machines because it provided a 20-30% throughput in our hands, over
>raid provided from our storage arrays (coraid ATA over ethernet shelves). In 1.8.5 this has worked more or less flawlessly,
>but we now have new storage, with 3Tb rather than 2Tb disks and new servers with 1.8.7-wc1 patched kernels.

I don't have any 3tb disks to test with, but I think you need to use a
newer superblock format for 3tb devices.
eg. use
mdadm -e 1.2 ...
see 'man mdadm' which says something about max 2tb devices for 0.90 format.

also I'm not quite sure how to read the below, but it kinda looks like
you have 17 3tb disks in a single raid? that's a lot... I thought
ldiskfs was only ok up to 24tb these days?

>md is unable to reliably shut down and restart arrays after the machines have been rebooted (cleanly) - the disks are no
>longer recognized as part of the arrays they were created within. In the kernel log we have seen the following messages below,
>which include the following:
>
> md: bug in file drivers/md/md.c, line 1677

if (!mddev->events) {
/*
* oops, this 64-bit counter should never wrap.
* Either we are in around ~1 trillion A.C., assuming
* 1 reboot per second, or we have a bug:
*/
MD_BUG();
mddev->events --;
}

so it looks like your md superblock is corrupted. that's consistent with
needing a newer superblock version.

other less likely possibilities:
- could it also be that your coraid devices have problems with >2TB?
- if you are running with 32bit kernels something could be wrong there.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

Samuel Aparicio

unread,

Mar 20, 2012, 12:18:24 PM3/20/12

to Robin Humble, Lustre Discussion list

Hello, thanks for this - it's a 16 disk raid10 (with one spare) so 24Tb. I previously tried 1.0 and 1.2 metadata, to no effect.

we are using 256 chunk sizes, I haven't tried reverting to 64k but will do so.

this looks to me like something different - there was an md patch for 2.6.18 kernels relating to updating of superblocks on arrays,

which this might be ...

the storage vendor has tried md with 3Tb disks and md version 3.2.3 and sees no problems, but this is also with a much later kernel version.

I am not sure 3.2.3 would work with kernels as old as 2.6.18-274

Joe Landman

unread,

Mar 20, 2012, 12:25:22 PM3/20/12

to lustre-...@lists.lustre.org

On 03/20/2012 12:18 PM, Samuel Aparicio wrote:
> Hello, thanks for this - it's a 16 disk raid10 (with one spare) so 24Tb.
> I previously tried 1.0 and 1.2 metadata, to no effect.
> we are using 256 chunk sizes, I haven't tried reverting to 64k but will
> do so.
>
> this looks to me like something different - there was an md patch for
> 2.6.18 kernels relating to updating of superblocks on arrays,
> which this might be ...
>
> the storage vendor has tried md with 3Tb disks and md version 3.2.3 and
> sees no problems, but this is also with a much later kernel version.
> I am not sure 3.2.3 would work with kernels as old as 2.6.18-274

It will, but the issue is more than likely the age of the kernel
combined with driver issues. We'd tried AoE in the dim and distant
past, and found all manner of corruption problems we couldn't work around.

If you can update to a newer kernel, this might be a better course.
1.8.7+ builds against some of the more modern kernels. I don't think
we've posted a build against our 2.6.32.58.scalable kernel, but I'll
look into it.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: lan...@scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Samuel Aparicio

unread,

Mar 20, 2012, 2:42:51 PM3/20/12

to lan...@scalableinformatics.com, lustre-...@lists.lustre.org

Thanks for this.

we have zero experience of patching lustre 1.8.7 against newer kernels, are there any known pitfalls we should avoid.

Professor Samuel Aparicio BM BCh PhD FRCPath
Nan and Lorraine Robertson Chair UBC/BC Cancer Agency
675 West 10th, Vancouver V5Z 1L3, Canada.
office: +1 604 675 8200 lab website http://molonc.bccrc.ca

Joe Landman

unread,

Mar 20, 2012, 5:53:11 PM3/20/12

to Samuel Aparicio, lustre-...@lists.lustre.org

On 03/20/2012 02:42 PM, Samuel Aparicio wrote:
> Thanks for this.
>
> we have zero experience of patching lustre 1.8.7 against newer kernels,
> are there any known pitfalls we should avoid.

I just completed a (completely untested at the moment) build of
2.6.32.59.scalable with 1.8.7 patches. We'll do some testing here
first, but if it looks like it works well, we'll put a pointer to the
RPM. Or if you want to risk your dog biting you while your servers
possibly exploding, strafing your data center with untold numbers of
bits, and a diskful or two of lost metadata, I'd be happy to make it
available before our testing (we don't care if we lose data on our test
rigs) begins.

Regards,

Joe

Samuel Aparicio

unread,

Mar 21, 2012, 12:13:26 AM3/21/12

to lan...@scalableinformatics.com, lustre-...@lists.lustre.org

I appreciate your taking the time to do that. If it looks like it basically works we'll give it a try and see if the md issues go away.

we have a pair of OSS servers we can try this on with a 200Tb filesystem and not risk losing anything important,

so I think what we may do is compare the 1.8.7 build against a newer kernel, with the 2.1.1 release,

which looks like it is approaching stability for production environments and see how that fares with mdadm

Professor Samuel Aparicio BM BCh PhD FRCPath
Nan and Lorraine Robertson Chair UBC/BC Cancer Agency
675 West 10th, Vancouver V5Z 1L3, Canada.
office: +1 604 675 8200 lab website http://molonc.bccrc.ca

Reply all

Reply to author

Forward