LVM over ZFS?

402 views
Skip to first unread message

Sean Reifschneider

unread,
May 12, 2021, 3:59:17 AM5/12/21
to gan...@googlegroups.com
Long time ganeti user here.  Some recent hardware weirdness has left me seriously planning on moving from LVM storage to ZFS, so I have a machine set up and am testing it.  I've tried both the candlerb and ffzg ext providers, I'm leaning towards candlerb just because it's simpler and doesn't touch the system LVM binaries.

Tonight I started thinking: What about just putting LVM on top of a ZVOL pv.  It's another layer of indirection, sure, but it gives us the ZFS scrub/checksum/error detection and correction, without having to use the ext provider.  I guess I'm uneasy because ffzg seems to say that there are some one way operations, and I haven't yet set up a second node to be able to test "gnt-instance move".

Aside: Now that I'm thinking about that though, it sure would be nice to be able to: snapshot, zfs send/recv, THEN shut down the node and only send the changed data.  I have some 3 hour moves that would be nice to be smaller.  I guess the same thing can be done with LVM and snapshot volumes...

Our reason for looking at ZFS is related to 2 machines that had unexplained data corruption on both the host OS and guests residing on 2 of 6 nodes.  At first I thought the issue might be related to a recent set of firmware updates that were applied to one node (backplane, BIOS, HBA, drives, around a year's worth of firmware updates).  But the other node that started behaving similarly had been up for 9 months without firmware updates.  I do rolling firmware updates about once a year.

The vendor ended up replacing the entire "RAID chain" on the first system (controller, cables, 2 of 6 drives, backplane).  It has me thinking "How did we even get here" where we had some silent corruption, despite having regular RAID array verifications going.

Thanks,
Sean

Brian Candler

unread,
May 27, 2021, 5:17:37 PM5/27/21
to ganeti
Taking a ZFS zvol and slicing it up with LVM in principle should work, and would let you run DRBD on top; but the downside is that at the ZFS layer,  you could only snapshot and replicate *all* your VMs at once.

I'm also not sure, when you delete an LVM volume, whether this would propagate through using TRIM to the zvol and actually free up the underlying space.

Silent data corruption with RAID is sadly a fact of life.  During a RAID scrub, a RAID5 has enough information to tell that parity is wrong - but has insufficient data to know how to fix it - and therefore, RAID controllers don't check for this condition, because it would make them look bad to report that data was corrupted irretrievably.

In theory, RAID6 would allow recovery from incorrect data on any single disk, but I don't know of any RAID controller that actually does that.

RAIDZn has all the checksums, so it knows which bits are good and which are bad.

Martin McClure

unread,
May 27, 2021, 8:00:55 PM5/27/21
to gan...@googlegroups.com, Brian Candler
On 5/27/21 2:17 PM, Brian Candler wrote:
> Silent data corruption with RAID is sadly a fact of life.  During a
> RAID scrub, a RAID5 has enough information to tell that parity is
> wrong - but has insufficient data to know how to fix it - and
> therefore, RAID controllers don't check for this condition, because it
> would make them look bad to report that data was corrupted irretrievably.
>
> In theory, RAID6 would allow recovery from incorrect data on any
> single disk, but I don't know of any RAID controller that actually
> does that.
>
Interesting. Do you know if that limitation on RAID6 recovery applies to
Linux kernel soft RAID? Our Ganeti nodes that host VMs with critical
data are built on LVM over RAID6 over five drives. I've had bad
experiences with hardware RAID cards, so this is Linux md RAID6, and
anyway for these particular applications, data integrity is more
important than speed.

Thanks,
-Martin

Brian Candler

unread,
May 29, 2021, 5:01:59 AM5/29/21
to ganeti
On Friday, 28 May 2021 at 01:00:55 UTC+1 Martin McClure wrote:
Do you know if that limitation on RAID6 recovery applies to
Linux kernel soft RAID?

-  although this is 7 years old.

According to that, it assumes the parity is bad and rewrites the parity, rather than using the parity to correct the bad data.  But you'd probably want to do a test to check if that's still that case.

I trust Linux md RAID6 at least as much as proprietary RAID controller cards.

However, I would not use any flavour of RAID6 for VM instance storage: it just performs way too poorly for small block writes (writing a single block involves reading the original block and parity, and writing back the original block and parity).  RAID10 all the way.

Use RAID6 for long-term archive of large data files which are written once and don't change.

Sascha Lucas

unread,
Oct 1, 2021, 9:59:19 AM10/1/21
to gan...@googlegroups.com
Hi,

On Sat, 29 May 2021 11:01:59 +0200 Brian Candler wrote:

> On Friday, 28 May 2021 at 01:00:55 UTC+1 Martin McClure wrote:
> Do you know if that limitation on RAID6 recovery applies to
> Linux kernel soft RAID?
>
>
>
> Answered here: https://unix.stackexchange.com/questions/137384/raid6-
> scrubbing-mismatch-repair
> -  although this is 7 years old.
>
>
> According to that, it assumes the parity is bad and rewrites the parity,
> rather than using the parity to correct the bad data.  But you'd probably
> want to do a test to check if that's still that case.

Since this discussion has occurred, this topic "RAID6 + silent data
corruption" regularly comes to my mind. I just want to throw in some
theoretical thoughts:

RAID6 can only recover from silent parity corruption, by majority vote of
one parity copy and the calculated parity from data stripes. In this case it
would be fine to just update the wrong parity copy.

Other than that, RAID6 can not recover from silent corruption on data
stripes, because it can't tell what data stripe is wrong (this is different
from a failed disk, where you can tell, what data strip is faulty). For
recovery of silent corruption one needs to store checksum along each data
stripes. Updating parity anyway sounds bad, but seems the only automatic
option?

What do you think? Are this thoughts right? If so, RAID6 has not much
advantage over RAID5 WRT silent corruption.

Thanks, Sascha.

Brian Candler

unread,
Oct 2, 2021, 12:02:56 PM10/2/21
to ganeti
On Friday, 1 October 2021 at 14:59:19 UTC+1 sascha wrote:
RAID6 can only recover from silent parity corruption, by majority vote of
one parity copy and the calculated parity from data stripes. In this case it
would be fine to just update the wrong parity copy.


Or more importantly, it could also update the wrong data copy, because any of the blocks could be corrupt (data or parity).

It's a theoretical feature of RAID6 that it could do this: i.e. given a set of N data + 2 parity, if any one of those blocks is corrupt, it could detect this and recover from it.  I don't know any practical implementations which do, and therefore I'd agree that in practice, "RAID6 has not much advantage over RAID5 WRT silent corruption".

It would be interesting to see if liberasure handles this.  Again in practice, I'd expect clients to read only the data blocks, and fetch the parity blocks if and only if data blocks are missing.  However,  a periodic "deep scrub" could in principle check all the blocks for consistency and repair if necessary.

ZFS detects and corrects errors on every read, because it knows for sure whether each block has been read back correctly or not (due to the parent checksum), and therefore knows what needs regenerating from parity.  It never returns bad data to the application.
Reply all
Reply to author
Forward
0 new messages