IIs factible to implement full writes of strips to raid using NVRAM memory in LFS?

Jose Luis Rodriguez Garcia

unread,

Aug 18, 2016, 12:39:59 PM8/18/16

to tech...@netbsd.org

I would like to implement this in LFS:

Write full stripes to RAID 5/6 from LFS using a NVRAM card or similar:

For this, the segment would be written to a NVRAM card or similar.
When the full segment is written to the NVRAM card, it would be
written to the raid as a full strip, without penalizing because of the
raid. I have have thought for easy implementation in increase/decrease
the size of segment on LFS, for that it is multiple of the stripe of
the RAID device.

Question: Is it a small/medium/big project?

At the moment I am reading code/fixing bugs that I found/helping to
stabilize code.

I suppose that it will take + yeae for me to understand fully LFS (my
pace is slow, because work and family).

Jose Luis Rodriguez Garcia

unread,

Aug 18, 2016, 1:20:28 PM8/18/16

to tech...@netbsd.org

s/factible/feasible/

Is feasible to implement....

Eduardo Horvath

unread,

Aug 18, 2016, 1:25:02 PM8/18/16

to Jose Luis Rodriguez Garcia, tech...@netbsd.org

On Thu, 18 Aug 2016, Jose Luis Rodriguez Garcia wrote:

> I would like to implement this in LFS:
>
> Write full stripes to RAID 5/6 from LFS using a NVRAM card or similar:
>
> For this, the segment would be written to a NVRAM card or similar.
> When the full segment is written to the NVRAM card, it would be
> written to the raid as a full strip, without penalizing because of the
> raid. I have have thought for easy implementation in increase/decrease
> the size of segment on LFS, for that it is multiple of the stripe of
> the RAID device.
>
> Question: Is it a small/medium/big project?

Ummm... It's probably a big project. And I'm not sure it buys you much if
anything.

A regular unix filesystem will use synchronous metadata writes to keep the
FS image consistent if the box crashes or loses power. NVRAM will speed
up those operations.

LFS writes the metadata at the same time, in the same place as the data.
No synchronous writes necessary. In theory, if there is a failure you
just roll back the log to an earlier synchronization point. You lose the
data after that point, but that should be fairly small, and you would have
lost it anyway with a regular FS. And you should be able to roll back the
filesystem to snapshots of any earlier synchronization points.

The problem is that LFS is less a product than a research project:

o Although there are multiple super blocks scattered across the disk just
like FFS, LFS only uses the first and last one. If both of those are
corrupt, the filesystem image cannot be recovered. LFS should be enhanced
to cycle through all the different super blocks for enhanced robustness.

o The rollback code is quite sketchy. It doesn't really work so well, so
LFS has problems recovering from failures.

o LFS keeps all of its inodes in a file called the ifile. It's a regular
LFS file, so in theory you can scan back to recover earlier revisions of
that file. Also, fsck_lfs should be able to reconstruct the ifile from
scrach by scanning the disk. This is yet another feature that has not
been implemented yet.

LFS writes data in what's called a subsegment. This is essentially an
atomic operation which contains data and metadata. The subsegments are
collected into segments, which contain more metadata, such as a current
snapshot of the ifile. All the disk sectors in a subsegment are
checksummed, so partial writes can be detected. If the checksums on the
subsegment is incorrect, LFS should roll back to a previous subsegment
that does have a correct checksum. I don't think that code exists, or if
it does I don't think it works.

Anyway, hacking on LFS is lots of fun. Enjoy!

Eduardo

Jose Luis Rodriguez Garcia

unread,

Aug 18, 2016, 1:59:06 PM8/18/16

to Eduardo Horvath, tech...@netbsd.org

On Thu, Aug 18, 2016 at 7:24 PM, Eduardo Horvath <e...@netbsd.org> wrote:
>
> LFS writes the metadata at the same time, in the same place as the data.
> No synchronous writes necessary.

As I understand LFS needs to do synchronous writes when there is
metadata operations (directories)/fsync operations involved. Instead
of writting a full segment (1 MB per default), it writes a "small
segment". It kills performance in RAID 5/6, because the write isn't a
full stripe: you have to read all the disks, for calculate the new
parity 1 write on raid of x disks= x reads + 2 writes (data + parity).

The NVRAM memory solves this problem as buffer/ write cache.

>
> The problem is that LFS is less a product than a research project:

Yes I know this. But it seems that it was near stable in previous
versions: 1.6 and 4.0 of NetBSD. David Holland is slowly solving some
of these problems (most of a MP kernel). I also wan to to give my
small help.

>
> Anyway, hacking on LFS is lots of fun. Enjoy!

This is true. I want to contribute to a great project as NetBSD.

Eduardo Horvath

unread,

Aug 18, 2016, 2:23:43 PM8/18/16

to Jose Luis Rodriguez Garcia, tech...@netbsd.org

On Thu, 18 Aug 2016, Jose Luis Rodriguez Garcia wrote:

> On Thu, Aug 18, 2016 at 7:24 PM, Eduardo Horvath <e...@netbsd.org> wrote:
> >
> > LFS writes the metadata at the same time, in the same place as the data.
> > No synchronous writes necessary.
>
> As I understand LFS needs to do synchronous writes when there is
> metadata operations (directories)/fsync operations involved. Instead
> of writting a full segment (1 MB per default), it writes a "small
> segment". It kills performance in RAID 5/6, because the write isn't a
> full stripe: you have to read all the disks, for calculate the new
> parity 1 write on raid of x disks= x reads + 2 writes (data + parity).
>
> The NVRAM memory solves this problem as buffer/ write cache.

If the rollback code worked properly then sync writes should not be
necessary. Looks like the SEGM_SYNC flag is only set when LFS is writing
a checkpoint. But I'm not sure there's any guarantee that earlier
segments have hit the disk.

Anyway, I still think fixing LFS so synchronous writes are not needed is a
better use of time than making it use a hardware workaround.

I suppose adding code to LFS where it posts the sync write to copy
it out to NVRAM would be relatively easy. But then you still need to hack
the recovery code to look for data in the NVRAM *and* figure out how to
use it to repair the filesystem. (Which it should be able to repair just
fine without the data in the NVRAM BTW.)

Better to fix the recovery code and just turn off sync writes entirely. I
suppose other people may have different opinions on the subject.

Eduardo

David Holland

unread,

Aug 18, 2016, 2:36:12 PM8/18/16

to Eduardo Horvath, Jose Luis Rodriguez Garcia, tech...@netbsd.org

some quibbles:

On Thu, Aug 18, 2016 at 05:24:53PM +0000, Eduardo Horvath wrote:
> And you should be able to roll back the
> filesystem to snapshots of any earlier synchronization points.

In LFS there are only two snapshots and in practice often one of
them's not valid (because it was halfway through being taken when the
machine went down) so rolling further back isn't that feasible.

> The problem is that LFS is less a product than a research project:
>
> o Although there are multiple super blocks scattered across the disk just
> like FFS, LFS only uses the first and last one. If both of those are
> corrupt, the filesystem image cannot be recovered. LFS should be enhanced
> to cycle through all the different super blocks for enhanced robustness.

This should be left to fsck, like it is in ffs. I don't remember if
fsck_lfs supports recovering from an alternate superblock, but it
isn't going to be that hard.

> o The rollback code is quite sketchy. It doesn't really work so well, so
> LFS has problems recovering from failures.

Rolling *back* to the last snapshot is easy. It's the roll-forward
code that's dodgy, isn't it?

> o LFS keeps all of its inodes in a file called the ifile. It's a regular
> LFS file, so in theory you can scan back to recover earlier revisions of
> that file. Also, fsck_lfs should be able to reconstruct the ifile from
> scrach by scanning the disk. This is yet another feature that has not
> been implemented yet.

That's not how the ifile works. It's a file of inode locations, not
inodes. However, that means it *can* be reconstructed. I'm not sure to
what extent fsck_lfs can do this.

> LFS writes data in what's called a subsegment. This is essentially an
> atomic operation which contains data and metadata. The subsegments are
> collected into segments, which contain more metadata, such as a current
> snapshot of the ifile. All the disk sectors in a subsegment are
> checksummed, so partial writes can be detected. If the checksums on the
> subsegment is incorrect, LFS should roll back to a previous subsegment
> that does have a correct checksum. I don't think that code exists, or if
> it does I don't think it works.

That's not how it works.

--
David A. Holland
dhol...@netbsd.org

David Holland

unread,

Aug 18, 2016, 2:41:24 PM8/18/16

to Jose Luis Rodriguez Garcia, Eduardo Horvath, tech...@netbsd.org

On Thu, Aug 18, 2016 at 07:58:53PM +0200, Jose Luis Rodriguez Garcia wrote:
> > LFS writes the metadata at the same time, in the same place as the data.
> > No synchronous writes necessary.
>
> As I understand LFS needs to do synchronous writes when there is
> metadata operations (directories)/fsync operations involved. Instead
> of writting a full segment (1 MB per default), it writes a "small
> segment". It kills performance in RAID 5/6, because the write isn't a
> full stripe: you have to read all the disks, for calculate the new
> parity 1 write on raid of x disks= x reads + 2 writes (data + parity).
>
> The NVRAM memory solves this problem as buffer/ write cache.

Short segments occur because the ratio of syncs to new blocks written
is too high in practice: you have to write out before there's enough
data to fill a segment. Rearranging it to assemble whole segments in
nvram before writing them to disk is possible but would be a fairly
big project.

One could also integrate this with cleaning so you ~always write a
whole segment by filling it up with data from cleaning if you don't
have anything else to write, but that's a *big* project.

Right now my chief concern is making it work reliably, since it
currently seems not to. The first order of business seems to be to
come up with a new locking model, since the existing locking is not
just bodgy (like we knew) but also not self-consistent and in places
glaringly incorrect.

Eduardo Horvath

unread,

Aug 18, 2016, 3:00:11 PM8/18/16

to David Holland, Jose Luis Rodriguez Garcia, tech...@netbsd.org

On Thu, 18 Aug 2016, David Holland wrote:

> some quibbles:
>
> On Thu, Aug 18, 2016 at 05:24:53PM +0000, Eduardo Horvath wrote:
> > And you should be able to roll back the
> > filesystem to snapshots of any earlier synchronization points.
>
> In LFS there are only two snapshots and in practice often one of
> them's not valid (because it was halfway through being taken when the
> machine went down) so rolling further back isn't that feasible.

I don't remember seeing any code that overwrites old snapshots, so most of
them are still on disk. It's just a question of finding them, which is
where the first and last superblock come into play.

> > The problem is that LFS is less a product than a research project:
> >
> > o Although there are multiple super blocks scattered across the disk just
> > like FFS, LFS only uses the first and last one. If both of those are
> > corrupt, the filesystem image cannot be recovered. LFS should be enhanced
> > to cycle through all the different super blocks for enhanced robustness.
>
> This should be left to fsck, like it is in ffs. I don't remember if
> fsck_lfs supports recovering from an alternate superblock, but it
> isn't going to be that hard.

The LFS super block contains a pointer to the end of the log. Since LFS
only ever updates that pointer on the firt and last superblock, if you try
to use any other superblock to recover the filesystem you essentially roll
it back to just after the newfs_lfs ran.

>
> > o The rollback code is quite sketchy. It doesn't really work so well, so
> > LFS has problems recovering from failures.
>
> Rolling *back* to the last snapshot is easy. It's the roll-forward
> code that's dodgy, isn't it?

In my experience the rollback code also has issues. I've seen it get
badly confused.

Eduardo

Paul_...@dell.com

unread,

Aug 18, 2016, 3:42:34 PM8/18/16

to jose...@gmail.com, tech...@netbsd.org

> On Aug 18, 2016, at 1:58 PM, Jose Luis Rodriguez Garcia <jose...@gmail.com> wrote:
>
> On Thu, Aug 18, 2016 at 7:24 PM, Eduardo Horvath <e...@netbsd.org> wrote:
>>
>> LFS writes the metadata at the same time, in the same place as the data.
>> No synchronous writes necessary.
>
> As I understand LFS needs to do synchronous writes when there is
> metadata operations (directories)/fsync operations involved. Instead
> of writting a full segment (1 MB per default), it writes a "small
> segment". It kills performance in RAID 5/6, because the write isn't a
> full stripe: you have to read all the disks, for calculate the new
> parity 1 write on raid of x disks= x reads + 2 writes (data + parity).

That's not correct.

You're right that partial stripe writes are more expensive, but they are not *that* expensive. For RAID 5, the cost is two read and two writes: read/write of the data disk, and read/write of the parity disk. For RAID 6 is it three reads and writes (read/write on both of the P and Q disks).

paul

Jose Luis Rodriguez Garcia

unread,

Aug 18, 2016, 4:03:07 PM8/18/16

to Paul_...@dell.com, tech...@netbsd.org

Thank you for your correction.

I had this doubt from time ago. but i was inclined to the wrong answers.

A question out of topic: ¿A raid-5 of 5 disks has the same performance
for partial stripe writes than a raid-5 of 10 disks? If the parity is
"striped" across all disks, then a raid-5 of 10 disks must have better
performance.

Bert Kiers

unread,

Aug 18, 2016, 6:34:55 PM8/18/16

to Jose Luis Rodriguez Garcia, Eduardo Horvath, tech...@netbsd.org

On Thu, Aug 18, 2016 at 07:58:53PM +0200, Jose Luis Rodriguez Garcia wrote:

Hi,

> As I understand LFS needs to do synchronous writes when there is
> metadata operations (directories)/fsync operations involved. Instead
> of writting a full segment (1 MB per default), it writes a "small
> segment". It kills performance in RAID 5/6, because the write isn't a
> full stripe: you have to read all the disks, for calculate the new
> parity 1 write on raid of x disks= x reads + 2 writes (data + parity).
>
> The NVRAM memory solves this problem as buffer/ write cache.

Sounds like WAFL https://en.wikipedia.org/wiki/Write_Anywhere_File_Layout

Even though that article doesn't mention NVRAM, all Netapp filers have
a (fixed) amount of NVRAM.

Our big FAS8080 has 16384 MB NVRAM, the small FAS2554 has 2 GB NVRAM.
Writes are very fast.

Grtnx,
--
B*E*R*T

Jose Luis Rodriguez Garcia

unread,

Aug 18, 2016, 6:40:49 PM8/18/16

to David Holland, Eduardo Horvath, tech...@netbsd.org

On Thu, Aug 18, 2016 at 8:41 PM, David Holland <dholla...@netbsd.org> wrote:
..

> data to fill a segment. Rearranging it to assemble whole segments in
> nvram before writing them to disk is possible but would be a fairly
> big project.
>
> One could also integrate this with cleaning so you ~always write a
> whole segment by filling it up with data from cleaning if you don't
> have anything else to write, but that's a *big* project.

Ok. I thought that it was more easy.

By the moment I will continue reading code, and trying to send more
patches until I decide what to do.

Thor Lancelot Simon

unread,

Aug 19, 2016, 11:27:57 AM8/19/16

to Eduardo Horvath, Jose Luis Rodriguez Garcia, tech...@netbsd.org

On Thu, Aug 18, 2016 at 06:23:32PM +0000, Eduardo Horvath wrote:
>
> I suppose adding code to LFS where it posts the sync write to copy
> it out to NVRAM would be relatively easy.

Perhaps, but I bet it'd be easier to implement a generic pseudodisk
device that used NVRAM (fast SSD, etc -- just another disk device really)
to buffer *all* writes to a given size and feed them out in that-size
chunks. Or to add support for that to RAIDframe.

For bonus points, do what the better "hardware" RAID cards do, and if
the inbound writes are already chunk-size or larger, bypass them around
the buffering (it implies extra copies and waits after all).

That would help LFS and much more. And you can do it without having
to touch the LFS code.

However, there's stll the issue that this could effectively reorder
reads around synchronous writes if you are not careful, particularly
with the boot-time replay. Ensuring that is right is probably the
trickiest part by far.

Thor

Jose Luis Rodriguez Garcia

unread,

Aug 19, 2016, 4:01:56 PM8/19/16

to Thor Lancelot Simon, Eduardo Horvath, tech...@netbsd.org

On Fri, Aug 19, 2016 at 5:27 PM, Thor Lancelot Simon

>
> Perhaps, but I bet it'd be easier to implement a generic pseudodisk
> device that used NVRAM (fast SSD, etc -- just another disk device really)
> to buffer *all* writes to a given size and feed them out in that-size
> chunks. Or to add support for that to RAIDframe.
>
> For bonus points, do what the better "hardware" RAID cards do, and if
> the inbound writes are already chunk-size or larger, bypass them around
> the buffering (it implies extra copies and waits after all).
>
> That would help LFS and much more. And you can do it without having
> to touch the LFS code.

Won't it be easier to add a layer that do these tasks in
the LFS code. It has the disadvantage that it would be used only by
LFS, but it seems more simple, and it is more easier to bypass in
cases that a full segment write is done, and add future optimizations.

It would be very useful for LFS if the physical segments are usually
full. Physical segments contain one ore more segments.

Can some confirm that LFS completes the physical segments in most of
the cases? It is that I understand reading the Margo paper.

It is more or less, the thing that I had in mind.

> However, there's stll the issue that this could effectively reorder
> reads around synchronous writes if you are not careful, particularly
> with the boot-time replay. Ensuring that is right is probably the
> trickiest part by far.

Could you elaborate or put a example?

If the driver/layer is a write cache of the disk, the boot-time replay
must access to the driver/layer, and use the blocks in the NVRAM if
they exist, I don't see the role of reorder reads.

Jose Luis

Jose Luis Rodriguez Garcia

unread,

Aug 19, 2016, 9:21:04 PM8/19/16

to Thor Lancelot Simon, Eduardo Horvath, tech...@netbsd.org

On Fri, Aug 19, 2016 at 5:27 PM, Thor Lancelot Simon <t...@panix.com> wrote:
> On Thu, Aug 18, 2016 at 06:23:32PM +0000, Eduardo Horvath wrote:
> chunks. Or to add support for that to RAIDframe.

..............

> That would help LFS and much more. And you can do it without having
> to touch the LFS code.
>

I have been thinking about this, and I think that this is the best
option, although I like more integrate it with LFS as I said in my
previous mail, adding to RAIDframe it can be used/tested by more
people and it is possible that more developers/testers are involved.
Integrating it inside of LFS surely will be a one man project, that it
is very possible that it isn't finished.
Other bonus of integrating it with RAIDframe, it is can resolve the
problems of write hole of raid:
http://www.raid-recovery-guide.com/raid5-write-hole.aspx
I don't know if NetBSD resolves the problem of write hole (it has
penalty in performance to resolve it).

> However, there's stll the issue that this could effectively reorder
> reads around synchronous writes if you are not careful, particularly
> with the boot-time replay. Ensuring that is right is probably the
> trickiest part by far.

If you can elaborate/explain this, it will be great. If
raidframe+nvram provides a coherent RAID with the commited writes, I
don't see possible problems with the LFS boot-time replay.

Thor Lancelot Simon

unread,

Aug 21, 2016, 10:20:25 AM8/21/16

to Jose Luis Rodriguez Garcia, Eduardo Horvath, tech...@netbsd.org

On Fri, Aug 19, 2016 at 10:01:43PM +0200, Jose Luis Rodriguez Garcia wrote:
> On Fri, Aug 19, 2016 at 5:27 PM, Thor Lancelot Simon
> >
> > Perhaps, but I bet it'd be easier to implement a generic pseudodisk
> > device that used NVRAM (fast SSD, etc -- just another disk device really)
> > to buffer *all* writes to a given size and feed them out in that-size
> > chunks. Or to add support for that to RAIDframe.
> >
> > For bonus points, do what the better "hardware" RAID cards do, and if
> > the inbound writes are already chunk-size or larger, bypass them around
> > the buffering (it implies extra copies and waits after all).
> >
> > That would help LFS and much more. And you can do it without having
> > to touch the LFS code.
>
> Won't it be easier to add a layer that do these tasks in
> the LFS code. It has the disadvantage that it would be used only by

I am guessing not. The LFS code is very large and complex -- much more
so than it needs to be. It is many times the size of the original Sprite
LFS code, which, frankly, worked better in almost all ways. It represents
(to me) a failed experiment at code and datastructure sharing with FFS (it
is also worse, and larger, because Sprite's buffer cache and driver APIs
were simpler than ours and better suited to LFS' needs).

It is so large and so complex that truly screamingly funny bugs like
writing the blocks of a segement out in backwards order went undetected
for long periods of time!

It might be possible to build something like this inside RAIDframe or LVM
but I think it would share little code with any other component of those
subsystems. I would suggest building it as a standalone driver which
takes a "data disk" and "cache disk" below and provides a "cached disk"
above. I actually think relatively little code should be required, and
avoiding interaction with other existing filesystems or pseudodisks should
keep it quite a bit simpler and cleaner.

Thor

Greg Oster

unread,

Aug 22, 2016, 11:12:44 AM8/22/16

to Jose Luis Rodriguez Garcia, Thor Lancelot Simon, Eduardo Horvath, tech...@netbsd.org

On Sat, 20 Aug 2016 03:20:51 +0200

Jose Luis Rodriguez Garcia <jose...@gmail.com> wrote:

> On Fri, Aug 19, 2016 at 5:27 PM, Thor Lancelot Simon <t...@panix.com>
> wrote:
> > On Thu, Aug 18, 2016 at 06:23:32PM +0000, Eduardo Horvath wrote:
> > chunks. Or to add support for that to RAIDframe.

> ...............

> > That would help LFS and much more. And you can do it without having
> > to touch the LFS code.
> >
> I have been thinking about this, and I think that this is the best
> option, although I like more integrate it with LFS as I said in my
> previous mail, adding to RAIDframe it can be used/tested by more
> people and it is possible that more developers/testers are involved.
> Integrating it inside of LFS surely will be a one man project, that it
> is very possible that it isn't finished.
> Other bonus of integrating it with RAIDframe, it is can resolve the
> problems of write hole of raid:
> http://www.raid-recovery-guide.com/raid5-write-hole.aspx
> I don't know if NetBSD resolves the problem of write hole (it has
> penalty in performance to resolve it).

RAIDframe maintains a 'Parity status:', which indicates whether or not
all the parity is up-to-date. Jed Davis did the GSoC work to add the
'parity map' stuff which significantly reduces the amount of effort
needed to ensure the parity is up-to-date after a crash. (Basically
RAIDframe checks (and corrects) any parity blocks in any modified
regions of the RAID set.)

Later...

Greg Oster

Greg Oster

unread,

Aug 22, 2016, 12:39:58 PM8/22/16

to Thor Lancelot Simon, Jose Luis Rodriguez Garcia, Eduardo Horvath, tech...@netbsd.org

Building this as a layer that allows arbitrary devices as either the
'main store' or the 'cache' would work well, and allow for all sorts of
flexibility. What I don't know is how you'd glue that in to be a
device usable for /. The RAIDframe code in that regard is already a
nightmare!

Perhaps something along the lines of the dk(4) driver, where one could
either use it as a stand-alone device, or hook into it to use the
caching features.. (e.g. 'register' the cache when raid0 is configured,
and then use/update the cache on reads/writes/etc to raid0)

Obviously this needs to be fleshed out a significant amount...

Later...

Greg Oster

David Holland

unread,

Aug 27, 2016, 10:18:59 PM8/27/16

to Eduardo Horvath, David Holland, Jose Luis Rodriguez Garcia, tech...@netbsd.org

On Thu, Aug 18, 2016 at 07:00:02PM +0000, Eduardo Horvath wrote:
> > > And you should be able to roll back the
> > > filesystem to snapshots of any earlier synchronization points.
> >
> > In LFS there are only two snapshots and in practice often one of
> > them's not valid (because it was halfway through being taken when the
> > machine went down) so rolling further back isn't that feasible.
>
> I don't remember seeing any code that overwrites old snapshots, so most of
> them are still on disk. It's just a question of finding them, which is
> where the first and last superblock come into play.

Ok, partly I was mixing things up with the Sprite LFS, which has two
largish checkpoint areas it alternates between.

In BSD LFS the checkpoint space is small and in the superblock, but as
noted it still only uses two of them to checkpoint into. So I'm not
sure what you mean by old snapshots.

> > > o Although there are multiple super blocks scattered across the
> > > disk just like FFS, LFS only uses the first and last one. If
> > > both of those are corrupt, the filesystem image cannot be
> > > recovered. LFS should be enhanced to cycle through all the
> > > different super blocks for enhanced robustness.
> >
> > This should be left to fsck, like it is in ffs. I don't remember if
> > fsck_lfs supports recovering from an alternate superblock, but it
> > isn't going to be that hard.
>
> The LFS super block contains a pointer to the end of the log. Since LFS
> only ever updates that pointer on the firt and last superblock, if you try
> to use any other superblock to recover the filesystem you essentially roll
> it back to just after the newfs_lfs ran.

Only if you robotically march off the cliff. Like in FFS, if you have
a superblock (any of them) that tells you critical things about how
the FS is laid out. That tells you where the segments are, and that
lets you identify the segment headers.

Once you have that you can reassemble the rest. The segment headers
contain enough information to rebuild the checkpoint data. The
critical thing is to be able to detect and ignore a segment that was
only half written, but the design provides for that. (And also to not
accidentally process a stale segment header left on disk after a
newfs, but perseant fixed that.)

I don't know if the fsck_lfs we have can do this, though.

Jose Luis Rodriguez Garcia

unread,

Sep 3, 2016, 1:06:42 PM9/3/16

to Greg Oster, Thor Lancelot Simon, Eduardo Horvath, tech...@netbsd.org

On Mon, Aug 22, 2016 at 5:12 PM, Greg Oster <os...@netbsd.org> wrote:
..

> RAIDframe maintains a 'Parity status:', which indicates whether or not
> all the parity is up-to-date. Jed Davis did the GSoC work to add the
> 'parity map' stuff which significantly reduces the amount of effort
> needed to ensure the parity is up-to-date after a crash. (Basically
> RAIDframe checks (and corrects) any parity blocks in any modified
> regions of the RAID set.)
>

As I understand, parity + parity status, only resolves in a efficient way the
problem of write hole of RAID 1/rebuilding the parity, although it can
have a small penalty
in performance or time for rebuild the parity in busy storage, that
usually are critical servers.

For RAID-5/6 The problem of write hole continues (it is only resolved
if it only affects the parity).

Jose Luis Rodriguez Garcia

unread,

Sep 4, 2016, 9:36:43 AM9/4/16

to Greg Oster, Thor Lancelot Simon, Eduardo Horvath, tech...@netbsd.org

On Sat, Sep 3, 2016 at 7:06 PM, Jose Luis Rodriguez Garcia
<jose...@gmail.com> wrote:
>
> For RAID-5/6 The problem of write hole continues (it is only resolved
> if it only affects the parity).

Sorry by the noise. Thinking about this, the write hole problem
happens only with the parity, thus
there isn't write hole problem with schema that Raidframe follows.