What is the best layer/device for a write-back cache based in nvram?

Jose Luis Rodriguez Garcia

unread,

Sep 9, 2016, 5:10:05 PM9/9/16

to tech...@netbsd.org

This is a continuation of the thread Is factible to implement full
writes of stripes to raid using NVRAM memory in LFS.
http://mail-index.netbsd.org/tech-kern/2016/08/18/msg020982.html

I want to discuss in what layer must be located a write back-cache. It
will be used usually for for raid configurations as a general purpose
device: any type of filesystem or raw.

Before of discussing the different options, I want to present the
benefits of a write-back, that I think must be supported by a
write-back cache, for check that they can be supported for the
different options.

1- There is no need to use parity map for the RAID 1/10/5/6. Usually
the impact is small, but it can be noticeable in busy servers.
a) There isn't parity to rebuild. The parity is always up to date.
Less down time in case of os crash / power failure / hardware failure
b) Better performance for RAID 1/5/6. It isn't necessary to update
the parity map because they don't exist.

2- In scattered writes contained in a same slice, it allows to reduce
the number of writes. With RAID 5/6 there is a advantage, the parity
is written only one time for several writes in the same slice, instead
of one time for every write in the same slice.
3- It allows to consolidate several writes that takes the full length
of the stripe in one write, without reading the parity. This can be
the case for log structured file systems as LFS, and allows to use a
RAID 5/6 with the similar performance of a RAID-0.
4- Faster synchronous writes.

The proposed layer must support:

A- It must be able to obtain the raid configuration of the raid device
backing the writeback cache. If it is a RAID 0/1 it will cache
portions of the size of the interleave. If it is RAID 5/6 it will
cache the size of a full slice.

B- It can use the buffer cache for avoid read/write cycles, and do
only writes if the data to be read is in memory.

C- Several devices can share the same write back-cache device ->
optimal and easy to configure. There is not need to hard partitioning
a NVRAM device in smaller devices with one partition over-used and
other infra-used.

D- In the case of filesystems as LFS, it would be useful to do the
next optimization: when a slice is complete in the buffer, write it in
a short time, because this won't be written any more.

E- It can be useful to use elevator algorithms for do the writes from
buffer cache to raid.

These are the three options proposed by Thor. I would like to know
what is the best option for you:

1- Implement in a generic pseudodisk the write-back cache. This
pseuododisk is attached on a raid/disk/etc. This is also the option
suggested by Greg.

It seems the option more recommended in the previous thread.

2- Add this to Raidframe.

Is it more easy to implement/to integrate with Raidframe?. The raid
configurations are contained in the same driver.

It can be more easy for a sysadmin to configure: less
devices/commands, and not prune to corruption errors: there isn't a
device with write-back cache and the same device without write-back
cache.
For non raid devices It can be used as raid 0 of one disk.

3- LVM. I don't see special advantage in this option.

I want to leave for other thread what devices must be supported:
nvram/nvme/disk/etc.

As notes:
- mdadm has the possibility to use a disk (flash by example) as
journal for raid devices. It would be used instead of parity maps. It
has been integrated inside of mdadm: https://lwn.net/Articles/665299/

- I don't think that it is possible to use write back-cache for boot
the OS in a easy way.

Thor Lancelot Simon

unread,

Sep 10, 2016, 10:19:11 AM9/10/16

to Jose Luis Rodriguez Garcia, tech...@netbsd.org

On Fri, Sep 09, 2016 at 11:09:49PM +0200, Jose Luis Rodriguez Garcia wrote:
>
> The proposed layer must support:
>
> A- It must be able to obtain the raid configuration of the raid device
> backing the writeback cache. If it is a RAID 0/1 it will cache
> portions of the size of the interleave. If it is RAID 5/6 it will
> cache the size of a full slice.

Don't bother with this for now. Let the user configure the cache segment
manually. Passing the information you describe around between layers
is hard and is a problem better left for later.

> B- It can use the buffer cache for avoid read/write cycles, and do
> only writes if the data to be read is in memory.

I'd stay more focused on avoiding _copies_. Any sane implementation
must satisfy reads from the cache layer if the data are cached. I
would avoid confusing this layer with the page or buffer caches.

> C- Several devices can share the same write back-cache device ->
> optimal and easy to configure. There is not need to hard partitioning
> a NVRAM device in smaller devices with one partition over-used and
> other infra-used.

I think this is again too large a goal. If you are worried about
wear levelling, don't -- any modern backing device will do that for
you.

> D- In the case of filesystems as LFS, it would be useful to do the
> next optimization: when a slice is complete in the buffer, write it in
> a short time, because this won't be written any more.

This is pretty much the same as "A", is it not? Let the user tell you
the cache segment size (chunk size, slice size, whatever you want to
call it) and be done, for now.

> E- It can be useful to use elevator algorithms for do the writes from
> buffer cache to raid.

I definitely don't agree. If you do this, you'll end up with request
sorting at at least three layers! The only reason to sort requests
would be if you were going to coalesce them, and I would again urge
that you do that later.

--
Thor Lancelot Simon t...@panix.com

"The dirtiest word in art is the C-word. I can't even say 'craft'
without feeling dirty." -Chuck Close

David Holland

unread,

Sep 10, 2016, 1:41:56 PM9/10/16

to Jose Luis Rodriguez Garcia, tech...@netbsd.org

On Fri, Sep 09, 2016 at 11:09:49PM +0200, Jose Luis Rodriguez Garcia wrote:

> This is a continuation of the thread Is factible to implement full
> writes of stripes to raid using NVRAM memory in LFS.
> http://mail-index.netbsd.org/tech-kern/2016/08/18/msg020982.html
>
> I want to discuss in what layer must be located a write back-cache. It
> will be used usually for for raid configurations as a general purpose
> device: any type of filesystem or raw.

It sounds like you've already decided what layer it appears in, if
it's going to be used as a block device. I guess your question is
whether it should be integrated into raidframe or sit on top of it?
My recommendation would be to make a separate entity that's the cache,
and then add a small amount of code to raidframe to call sideways into
it when needed. Then you can also add similar code to non-raidframe ld
or wd/sd devices and you don't end up exposing raidframe internals.

Some other thoughts:

- What you're talking about seems like it only really makes sense if
you have "fast" NVRAM, like PCM (when/if it ever appears in the market
for real) or battery-backed DRAM. Or maybe if you're using flash in
front of a small number of spinny-disks. Otherwise the cost of writing
to the cache, then reading from the cache and writing back to the
underlying device, is likely to outweigh whatever you save by avoiding
excess parity I/O.

- If you want your NVRAM cache to be recoverable (which sounds like
it's the point) you need to write enough logging data to it to be able
to recover it. This effectively means doing two writes for every write
you cache: one with the data and one to remember where the data is.
You can conceivably batch the metadata writes, but batching those
suffers from the same issues (small write chunks, frequent syncs,
etc.) that you're trying to avoid in the RAID so you can't expect it
to work very well. If both the cache and the RAID are flash, three
extra I/Os for every block means you have to save at least three I/Os
in the RAID for every block; that is not likely to be the case. Maybe
the transfer from the cache to the RAID is invisible and you only need
to save two, but that still doesn't seem that likely.

- The chief benefit of using flash as a frontend cache for spinny
disks turns out to be that flash is much larger than main memory. This
works fine if the cache is treated as expendable in crashes (and also
can be used safely with consumer-grade SSDs that munge themselves when
the power goes off)... making the cache persistent is expensive and
only helps with write-heavy workloads that otherwise saturate the
write bandwidth or that sync a lot. I guess the latter is what you're
after... but it will still only help in front of spinny disks.

- If you have "fast" NVRAM it won't be particularly large. (Maybe
someday PCM or memristors or something will be substantially cheaper
than DRAM and only marginally slower, but that doesn't seem too
likely, and it certainly isn't the case today or likely to happen
anytime soon.) This means that the volume of writes it can absorb will
be fairly limited. However, it'll still probably be at least somewhat
useful for workloads that sync a lot. The catch is that PCM and
memristors and whatnot don't actually exist yet in useful form, and
while battery-backed DRAM does in principle, such hardware isn't
readily available so the virtues of supporting it are limited.

- It might also make sense for LFS to assemble segments in "fast"
NVRAM, although the cost of implementing this will be pretty high. It
should be able to make use of the same entity I described above -- as
this should eliminate the need to have another one underneath it, it
won't be redundant that way.

- If anyone ever gets around to merging the Harvard journaling FFS,
which supports external journals, it would be straightforward to put
that journal on an NVRAM device, "fast" or otherwise. WAPBL doesn't
really support this though (AFAIK) and doing it won't solve WAPBL's
other problems (will probably exacerbate them) so isn't all that
worthwhile.

- I don't think there's very much to be gained by trying to integrate
nvram caching with the buffer cache, at least right now. There are
several reasons for this: (1) the gains vs. having it as a separate
caching layer aren't that great; (2) the buffer cache interface has no
notion of persistence, so while it might work for a large
non-persistent flash cache it won't do anything for the problems
you're worried about without a fairly substantial redesign; (3) the
buffer cache code is a godawful mess that needs multiple passes with
torches and pitchforks before trying to extend it; (4) right now the
UBC logic is not integrated with the buffer cache interface so one
would also need to muck with UVM in some of its most delicate and
incomprehensible parts (e.g. genfs_putpages)... and (5) none of this
is prepared to cope with buffers that can't be mapped into system
memory.

- Note that because zfs has its own notions of disk arrays, and its
own notions of caching, you might be able to add something like this
to zfs more easily, at the cost of it working only with zfs and maybe
interacting badly with the rest of the system.

> 1- There is no need to use parity map for the RAID 1/10/5/6. Usually
> the impact is small, but it can be noticeable in busy servers.
> a) There isn't parity to rebuild. The parity is always up to date.
> Less down time in case of os crash / power failure / hardware failure
> b) Better performance for RAID 1/5/6. It isn't necessary to update
> the parity map because they don't exist.

Remember that you still need to write metadata equivalent to the
parity map to the NVRAM, because you have to be able to recover the
NVRAM and figure out what's on it after crashing.

> 2- In scattered writes contained in a same slice, it allows to reduce
> the number of writes. With RAID 5/6 there is a advantage, the parity
> is written only one time for several writes in the same slice, instead
> of one time for every write in the same slice.

Also, is the cache itself a RAID? If so, you have the same problem
recursively; if not, you lose the redundant mirroring until the data
is transferred out. Maybe the cache can always be a RAID 1 though.

> A- It must be able to obtain the raid configuration of the raid device
> backing the writeback cache. If it is a RAID 0/1 it will cache
> portions of the size of the interleave. If it is RAID 5/6 it will
> cache the size of a full slice.

As I said above I think the way this ought to work is by the raid code
calilng into the nvram cache, not the other way around.

> B- It can use the buffer cache for avoid read/write cycles, and do
> only writes if the data to be read is in memory.

I don't think that makes sense.

> C- Several devices can share the same write back-cache device ->
> optimal and easy to configure. There is not need to hard partitioning
> a NVRAM device in smaller devices with one partition over-used and
> other infra-used.

That adds a heck of a lot of complexity. Remember you need to be able
to recover the NVRAM after crashing.

> These are the three options proposed by Thor. I would like to know
> what is the best option for you:
>
> 1- Implement in a generic pseudodisk the write-back cache. This
> pseuododisk is attached on a raid/disk/etc. This is also the option
> suggested by Greg.
>
> It seems the option more recommended in the previous thread.
>
> 2- Add this to Raidframe.
>
> Is it more easy to implement/to integrate with Raidframe?. The raid
> configurations are contained in the same driver.
>
> It can be more easy for a sysadmin to configure: less
> devices/commands, and not prune to corruption errors: there isn't a
> device with write-back cache and the same device without write-back
> cache.
> For non raid devices It can be used as raid 0 of one disk.
>
> 3- LVM. I don't see special advantage in this option.

See above. I think what I suggested is a mixture of (1) and (2) and
preferable to either.

--
David A. Holland
dhol...@netbsd.org

Jose Luis Rodriguez Garcia

unread,

Sep 11, 2016, 5:57:39 AM9/11/16

to Thor Lancelot Simon, tech...@netbsd.org

On Sat, Sep 10, 2016 at 4:18 PM, Thor Lancelot Simon <t...@panix.com> wrote:
> On Fri, Sep 09, 2016 at 11:09:49PM +0200, Jose Luis Rodriguez Garcia wrote:
>> B- It can use the buffer cache for avoid read/write cycles, and do
>> only writes if the data to be read is in memory.
>
> I'd stay more focused on avoiding _copies_. Any sane implementation
> must satisfy reads from the cache layer if the data are cached. I
> would avoid confusing this layer with the page or buffer caches.

What are these _copies_ ? I don't understand it.

>> E- It can be useful to use elevator algorithms for do the writes from
>> buffer cache to raid.
>
> I definitely don't agree. If you do this, you'll end up with request
> sorting at at least three layers! The only reason to sort requests
> would be if you were going to coalesce them, and I would again urge
> that you do that later.

I was thinking in do synchronous writes, from the cache. But it seems
better to do
asynchronous writes instead, and let Raidframe/other layer to do the sort. It is
inefficient to do the sort several times.

Two questions:
1- Do Raidframe/ld/other drivers allow asynchronous writes?
2- What drivers/layers do the actual sort/elevator algorithm?

Yes I was thinking in coalesce them. I don't think that it is
incompatible with that
other layer performs the sorting

Edgar Fuß

unread,

Sep 14, 2016, 7:54:50 AM9/14/16

to tech...@netbsd.org

I'm using a 12TB RAIDframe Level 5 RAID (4+1 discs) in production.
There are 150 people's home directories and mail on FFFs file systems on it.

> 1- There is no need to use parity map for the RAID 1/10/5/6. Usually
> the impact is small, but it can be noticeable in busy servers.

I don't notice it.

> 2- In scattered writes contained in a same slice, it allows to reduce
> the number of writes. With RAID 5/6 there is a advantage, the parity
> is written only one time for several writes in the same slice, instead
> of one time for every write in the same slice.
> 3- It allows to consolidate several writes that takes the full length
> of the stripe in one write, without reading the parity. This can be
> the case for log structured file systems as LFS, and allows to use a
> RAID 5/6 with the similar performance of a RAID-0.

You ought to adjust youd slice size and FS block size then, I'd suppose.

I specifically don't get the LFS point. LFS writes in segments, which are
rather large. A segment should match a slice (or a number of them)
I would suppose LFS to perform great on a RAIDframe. Isn't Manuel Bouyer
using this in production?

> 4- Faster synchronous writes.
Y E S.
This is the only point I fully aggree on. We've had severe problems with
brain-dead software (Firefox, Dropbox) performing tons of synchronous 4K
writes (on a bs=16K FFS) which nearly killed us until I wrote Dotcache
(http://www.math.uni-bonn.de/people/ef/dotcache) and we set XDG_CACHE_HOME
to point to local storage.

Woth WAPBL, there's also the journal (plus sync writes flushing it).

Manuel Bouyer

unread,

Sep 14, 2016, 9:16:17 AM9/14/16

to tech...@netbsd.org

On Wed, Sep 14, 2016 at 01:54:34PM +0200, Edgar Fuß wrote:
> [...]

> I would suppose LFS to perform great on a RAIDframe. Isn't Manuel Bouyer
> using this in production?

No, I played with LFS at some point but I never used it in production.

--
Manuel Bouyer <bou...@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

Eduardo Horvath

unread,

Sep 14, 2016, 11:16:22 AM9/14/16

to Edgar Fuß, tech...@netbsd.org

On Wed, 14 Sep 2016, Edgar Fuß wrote:

> > 2- In scattered writes contained in a same slice, it allows to reduce
> > the number of writes. With RAID 5/6 there is a advantage, the parity
> > is written only one time for several writes in the same slice, instead
> > of one time for every write in the same slice.
> > 3- It allows to consolidate several writes that takes the full length
> > of the stripe in one write, without reading the parity. This can be
> > the case for log structured file systems as LFS, and allows to use a
> > RAID 5/6 with the similar performance of a RAID-0.
> You ought to adjust youd slice size and FS block size then, I'd suppose.
>
> I specifically don't get the LFS point. LFS writes in segments, which are
> rather large. A segment should match a slice (or a number of them)
> I would suppose LFS to perform great on a RAIDframe. Isn't Manuel Bouyer
> using this in production?
>
> > 4- Faster synchronous writes.
> Y E S.
> This is the only point I fully aggree on. We've had severe problems with
> brain-dead software (Firefox, Dropbox) performing tons of synchronous 4K
> writes (on a bs=16K FFS) which nearly killed us until I wrote Dotcache
> (http://www.math.uni-bonn.de/people/ef/dotcache) and we set XDG_CACHE_HOME
> to point to local storage.

Hm... Maybe what you need to do is make the LFS segment the same size as
the RAID stripe, then mount LFS async so it only ever writes entire
segments....

Eduardo

Jose Luis Rodriguez Garcia

unread,

Sep 14, 2016, 3:06:39 PM9/14/16

to e...@math.uni-bonn.de, tech...@netbsd.org

>> 1- There is no need to use parity map for the RAID 1/10/5/6. Usually
>> the impact is small, but it can be noticeable in busy servers.

>I don't notice it.

When there is a crash, the time to rebuild the raid < 1min?

..

>rather large. A segment should match a slice (or a number of them)
>I would suppose LFS to perform great on a RAIDframe. Isn't Manuel Bouyer
>using this in production?

This is the idea, but when there is a fsync, it must be written to
disk. Therefore there are small segments inside of one "physical
segment"

Also LFS is still far from being stable.

>> 4- Faster synchronous writes.
>Y E S.
>This is the only point I fully aggree on. We've had severe problems with
>brain-dead software (Firefox, Dropbox) performing tons of synchronous 4K
>writes (on a bs=16K FFS) which nearly killed us until I wrote Dotcache

(>http://www.math.uni-bonn.de/people/ef/dotcache) and we set XDG_CACHE_HOME

>to point to local storage.

This one is the last point in my list, but this is the obvius
adventage of any write cache

Hauke Fath

unread,

Sep 14, 2016, 3:29:31 PM9/14/16

to Edgar Fuß, tech...@netbsd.org

On Wed, 14 Sep 2016 13:54:34 +0200, Edgar Fuß wrote:
>> 4- Faster synchronous writes.
> Y E S.
> This is the only point I fully aggree on. We've had severe problems with
> brain-dead software (Firefox, Dropbox) performing tons of synchronous 4K
> writes (on a bs=16K FFS) which nearly killed us until I wrote Dotcache
> (http://www.math.uni-bonn.de/people/ef/dotcache) and we set XDG_CACHE_HOME
> to point to local storage.

Nice.

We run unburden <https://github.com/xtaran/unburden-home-dir> at work
on debian desktops, cutting login-to-desktop times for nfs homes from 1
min to a few seconds...

Since your average Linux App Programmer will work from a fast local
disk, treating session-local files consistently will take a while, I
guess.

hauke

--
Hauke Fath <ha...@Espresso.Rhein-Neckar.DE>
Ernst-Ludwig-Straße 15
64625 Bensheim
Germany

Jose Luis Rodriguez Garcia

unread,

Sep 14, 2016, 5:09:39 PM9/14/16

to David Holland, tech...@netbsd.org

Holland, thank you for your answers.

On Sat, Sep 10, 2016 at 7:41 PM, David Holland <dholla...@netbsd.org> wrote:
> On Fri, Sep 09, 2016 at 11:09:49PM +0200, Jose Luis Rodriguez Garcia wrote:
>
> It sounds like you've already decided what layer it appears in, if
> it's going to be used as a block device.

Is there other option: character device for disks? Sorry I don't understand.

>I guess your question is
>whether it should be integrated into raidframe or sit on top of it?
>My recommendation would be to make a separate entity that's the cache,
>and then add a small amount of code to raidframe to call sideways into
>it when needed. Then you can also add similar code to non-raidframe ld
>or wd/sd devices and you don't end up exposing raidframe internals.

I was thinking to sit on top of raidframe/others and not under them.
I think that it is more easy to do optimizations, as coalescence
writes : by example two writes of 512 bytes in the same
chunk/interleave when many disks have a block size of 4Kbytes.
It isn't top, I would have to create a hook for every write/read, in
every disk driver.

Also I would like to do other optimizations for raidframe, look below
in this mail for x+1, for the case of RAID 5/6, and avoid some reads.

I think that if it is on top it would be less intrusive in other drivers.

What are your motives for you think is better that the cache is under
the disk devices?

Also I have decided it would be a separate diver, not integrated in
raidframe, because it is the preferred solution at tech-kern.

> Some other thoughts:
>
> - What you're talking about seems like it only really makes sense if
> you have "fast" NVRAM, like PCM (when/if it ever appears in the market
> for real) or battery-backed DRAM. Or maybe if you're using flash in
> front of a small number of spinny-disks. Otherwise the cost of writing
> to the cache, then reading from the cache and writing back to the
> underlying device, is likely to outweigh whatever you save by avoiding
> excess parity I/O.

The list of benefits wasn't ordered by main importance. Only this point don't
justify create this driver. The motivation for me is LFS, reduce latency
in writes (this can be major benefit for most of the people) and reduce
number of I/O the raid 5/6.

I think that it is possible to achieve these goals with spinny disks.
For other type
of disks, it will have to be tested if there is advantage.

>
> - If you want your NVRAM cache to be recoverable (which sounds like
> it's the point) you need to write enough logging data to it to be able
> to recover it. This effectively means doing two writes for every write
> you cache: one with the data and one to remember where the data is.
> You can conceivably batch the metadata writes, but batching those
> suffers from the same issues (small write chunks, frequent syncs,
> etc.) that you're trying to avoid in the RAID so you can't expect it
> to work very well. If both the cache and the RAID are flash, three
> extra I/Os for every block means you have to save at least three I/Os
> in the RAID for every block; that is not likely to be the case. Maybe
> the transfer from the cache to the RAID is invisible and you only need
> to save two, but that still doesn't seem that likely.
>

My principal motivation is cache writes. I think that a cache of between
several megabytes - a few gigabytes will be ok.

Because there isn't much memory used in nvram or what ever the device is
used,the same content of the NVRAM can be stored in the RAM of the
server. Then the NVRAM would be only read for do a "recover" after of a
crash. In normal use there are only writes to the NVRAM

Then instead of three I/Os it is two I/O. I think that two I/O on a
fast PCIe device
be faster than a I/O to a spinny disk.

I was thinking in use pcie cards: they can be nvram memory/nvme (4) ... There
are several type of devices, but I would like to discuss this in other
thread, when
I have resolved my doubts of this thread,

> - The chief benefit of using flash as a frontend cache for spinny
> disks turns out to be that flash is much larger than main memory. This
> works fine if the cache is treated as expendable in crashes (and also
> can be used safely with consumer-grade SSDs that munge themselves when
> the power goes off)... making the cache persistent is expensive and
> only helps with write-heavy workloads that otherwise saturate the
> write bandwidth or that sync a lot. I guess the latter is what you're
> after... but it will still only help in front of spinny disks.
>

Yes, it is the main benefit. I would like that NetBSD can handle big RAIDs of 12
disks in RAID 6. I know that RAID 6 isn't stable and there is GSOC
proposal for test this.

> - It might also make sense for LFS to assemble segments in "fast"
> NVRAM, although the cost of implementing this will be pretty high. It
> should be able to make use of the same entity I described above -- as
> this should eliminate the need to have another one underneath it, it
> won't be redundant that way.

As I understand LFS after writting partial segments, because of fsync, at
the end writes the space of a full segments. The full slice of a RAID5/6
could be cached.

>
> - If anyone ever gets around to merging the Harvard journaling FFS,
> which supports external journals, it would be straightforward to put
> that journal on an NVRAM device, "fast" or otherwise. WAPBL doesn't
> really support this though (AFAIK) and doing it won't solve WAPBL's
> other problems (will probably exacerbate them) so isn't all that
> worthwhile.

I didn't heard about it. Could your provide some links about this, by
curiosity?

>
> - I don't think there's very much to be gained by trying to integrate
> nvram caching with the buffer cache, at least right now. There are
> several reasons for this: (1) the gains vs. having it as a separate
> caching layer aren't that great; (2) the buffer cache interface has no
> notion of persistence, so while it might work for a large
> non-persistent flash cache it won't do anything for the problems
> you're worried about without a fairly substantial redesign; (3) the
> buffer cache code is a godawful mess that needs multiple passes with
> torches and pitchforks before trying to extend it; (4) right now the
> UBC logic is not integrated with the buffer cache interface so one
> would also need to muck with UVM in some of its most delicate and
> incomprehensible parts (e.g. genfs_putpages)... and (5) none of this
> is prepared to cope with buffers that can't be mapped into system
> memory.
>

This wasn't in "my list" of main priorities of this drivers. If this
is difficult to
do, I can do a "read cache" for the blocks of the chucks stored in the
nvram. The drawback it that a chunk could be stored in buffer cache
previous to the write and be missed.

I repeat this is for the case : RAID 5 the inerleave size is 4 blocks.
of a write of 512 bytes in block x+0 of a column and if I have cached the
read of blocks x+1, x+2 and x+3, I will have only to do :
1 read of parity +1 write of data + 1 write of parity instead of
1 read of data fo read x+1,x+2,x+3,1 read of parity +1 write of data +
1 write of parity

Anyway, I think that it will be for phase 2.

>
> > A- It must be able to obtain the raid configuration of the raid device
> > backing the writeback cache. If it is a RAID 0/1 it will cache
> > portions of the size of the interleave. If it is RAID 5/6 it will
> > cache the size of a full slice.
>
> As I said above I think the way this ought to work is by the raid code
> calilng into the nvram cache, not the other way around.
>

XXXXXX

> > B- It can use the buffer cache for avoid read/write cycles, and do
> > only writes if the data to be read is in memory.
>
> I don't think that makes sense.

It is the case that I describe above for RAID 5/6 for avoid the reads
x+1,x+2,x+3.

>
> > C- Several devices can share the same write back-cache device ->
> > optimal and easy to configure. There is not need to hard partitioning
> > a NVRAM device in smaller devices with one partition over-used and
> > other infra-used.
>
> That adds a heck of a lot of complexity. Remember you need to be able
> to recover the NVRAM after crashing.
>

The schema of buffers stored in NVRAM could be something like this:
struct buffer_descriptor{
dev_t device;
daddr_t adress;
bitmap_t bitmap_of_cached_blocks<it indicates what blocks have cached writes
}
The nvram stores always the full slice/interleave. The bit map
indicates what are valid blocks.
By example there are 1000 buffer_descriptors and 1000 buffers

I don't see it difficult it to recover, even if there is shared
between several devices. When a device it is attached to the cache it
does the recover as first step, and recover its cached buffers.

Is there some big complexity because a cache is shared between several
devices in autoconf, etc?

Thor Lancelot Simon

unread,

Sep 14, 2016, 7:55:07 PM9/14/16

to Eduardo Horvath, Edgar Fu?, tech...@netbsd.org

You have essentially described exactly how Sprite LFS worked (I'll have to
dig out the source to see what it did about fsync().

Edgar Fuß

unread,

Sep 15, 2016, 4:54:04 AM9/15/16

to tech...@netbsd.org

JLRG> 1- There is no need to use parity map for the RAID 1/10/5/6. Usually
JLRG> the impact is small, but it can be noticeable in busy servers.
EF> I don't notice it.
JLRG> When there is a crash, the time to rebuild the raid < 1min?
I said I didn't notice an impact. Of course I do notice a massive performance
gain when needing to re-build the parity.