Plan: journalling fixes for WAPBL

Jaromír Doleček

unread,

Sep 21, 2016, 7:28:06 PM9/21/16

to tech...@netbsd.org

Hi,

I've been poking around in the WAPBL sources and some of the email
threads, also read the doc/roadmaps comments, so I'm aware of some of
the sentiment.

I think it would still be useful to get WAPBL safe to enable by
default again in NetBSD. Neither lfs64 nor Harvard journalling fs is
currently in tree. So it's unknown when they would be stable enough to
replace ffs by default. Also, I think that it is useful to keep some
kind of generic[*] journalling code, perhaps for use also for ext2fs
or maybe xfs one day.

In either case, IMO it is good to do also some generic system
improvements usable by any journalling solution.

I see following groups of useful changes. Reasonably for -8 timeframe,
IMO only group one really needs to be resolved to safely enable wapbl
journalling by default.

1. critical fixes for WAPBL
2. less critical fixes for WAPBL
3. performance improvements for WAPBL
4. disk subsystem and journalling-related improvements

1. Critical fixes for WAPBL
1.1 kern/47146 kernel panic when many files are unlinked
1.2 kern/50725 discard handling
1.3 kern/49175 degenerate truncate() case - too embarassing to leave in

2. Less critical fixes for WAPBL
2.1 kern/45676 flush semantics

2.2 (no PR) make group descriptor updates part of change transaction
The transaction, which changed the group descriptor, should contain
also the cg block write. Now the group descriptor blocks are written
to disk during filesystem sync via separate transaction, so it's quite
frequent they do not survive crash if it happens before sync. Normally
fsck fixes these easily using inode metadata, but fsck is skipped for
journalled filesystems. This IMO can lead to incorrect block
allocation, until fsck is actually run.

2.3 file data leaks on crashes
File data content blocks are written asynchronously, some of it can
make it to the disk before journal is commited, hence blocks can end
up back in different file on system crash. FFS always had it, even
with softdep albait more limited.

2.4 buffer blocks kept in memory until commit
Buffer cache bufs are kept in memory with B_LOCKED flag by wapbl,
starving the buffer cache subsystem.

3. WAPBL performance fixes
3.1 checksum journal data for commit
Avoid one of the two DIOCCACHESYNC by computing checksum over data and
storing it in the commit record; there is even field for it already,
so matter of implementation. There is however CPU use concern maybe.
crc32c hash is good candidate, do we need to have hash alternatives?
This seems to be reasonably simple to implement, needs just some hooks
into journal writes and journal replay logic.

3.2 use FUA (Force Unit Access) for commit record write
This avoids need to issue even the second DIOCCACHESYNC, as flushing
the disk cache is not really all that useful, I like the thread over
at:
http://yarchive.net/comp/linux/drive_caches.html
Slightly less controversially, this would allow the rest of the
journal records to be written asynchronously, leaving them to execute
even after commit if so desired. It may be useful to have this
behaviour optional. I lean towards skipping the disk cache flush as
default behaviour however, if we implement write barrier for the
commit record (see below).
WAPBL would need to deal with drives without FUA, i.e fall back to cache flush.

3.3 async, or 'group sync' writes
Submit all the journal block writes to the drive at once, instead of
writing the blocks synchronously one by one. We could even have the
journal block writes completely async if we have the commit record
checksum.
Implementing 'group sync' write would be quite simple, making it full
async is more difficult and actually not very useful for journalling,
since commit would force those writes to disk drive anyway if it's
write barrier (see below)

4. disk subsystem and journalling-related improvements
4.1 write barriers
The current DIOCCACHESYNC has a problem in that it could be quite
easily I/O starved if the drive is very loaded. Normally, the drive
firmware flushes the disk buffer very soon (i.e in region of
milliseconds, i.e. when it has full track of data), but concurrent
disk activity might prevent it from doing it soon enough.
More serious NetBSD kernel problem is however that DIOCCACHESYNC
bypasses bufq, so if there are any queued writes, DIOCCACHESYNC sends
the command do disk before those writes are sent to the drive.
In order to avoid both of them, it would be good to have a way to mark
a buf as barrier. bufq and/or disk routines would be changed to drain
the write queue before barrier write is sent to drive, and any later
writes would wait until barrier write completes. On sane hardware like
SCSI/SAS, this could be almost completely offloaded to the controller
by just using ORDERED tags, without need to drain the queue.
This would be semi-hard to implement, especially if it would require
changes to disk drivers.

4.2 scsipi default to ORDERED tags, change to SIMPLE
From a quick scsipi_base.c inspection, it seems we use ordered tag if
it was not specified in the request. This seems like a waste. This
probably assumes disksort() does miracle job, but bufq disksort can't
account for e.g. head positions, so this is actually misoptimization
even for spinning rust, and not useful at all for SSDs. We should
change default to SIMPLE and rely on disk firmware to do it's job.
This is very simple to do.

4.3 generic FUA flag support
In order to avoid full cache sync after journal comit, it would be
useful to mark certain writes (like the journal commit record write)
to bypass disk write cache. There is FUA bit in SCSI/SAS word, NCQ for
SATA, and on NVMe. This could be as simple as struct buf flag, which
the disk driver would act upon. Very simple to do for SCSI and NVMe
since the support for tags is already there, for SATA we'd need to
first implement NCQ support :D
This is quite easy to do, it's just struct buf flag and tweaks to
scsipi/nvme code.

4.4 NCQ support for AHCI
We need NCQ to support FUA flag. We have AHCI support, but without
NCQ. FreeBSD has support for AHCI NCQ, OpenBSD seems to have some kind
of support also, so can be used as reference besides the official AHCI
specification. Most of recent motherboards support AHCI mode. Some
non-AHCI PCI SATA controllers support NCQ too, but those would be out
of scope for now.
This is semi-hard to do.

I plan to start on group 1, followed by 3.1 checksum, 4.1 write
barrier, 4.3 generic FUA support, and finally 3.2 FUA usage.

Comments are welcome :)

Jaromir

[*] WAPBL is of course not so generic since it forces on-disk format
right now, but it could eventually be made more flexible.

Paul Goyette

unread,

Sep 21, 2016, 7:57:19 PM9/21/16

to Jaromír Doleček, tech...@netbsd.org

I think adding 2.2 (cg stuff) would also be important to include for
re-enabling by default.

Also consider:

While not particularly part of wapbl itself, I would like to see its
callers (ie, lfs) be more modular!

Currently, ffs (whether built-in or modular) has to be built with
OPTIONS WAPBL enabled in order to use wapbl. And the ffs module has to
"require" the wapbl module.

It would be desirable (at least for me) if ffs (and any future users of
wapbl) could auto-load the wapbl module whenever it is needed. IE, if
an existing log-enabled file-system is mounted (or if a new log needs to
be created), and possibly also when an existing log needs to be removed,
after a 'tunefs -l 0'.

This is probably beyond what you expected to do, but I just thought to
"throw it out there" to get it in everyone's radar screens. :)

Brian Buhrow

unread,

Sep 21, 2016, 8:06:30 PM9/21/16

to Paul Goyette, Jaromír Doleček, tech...@netbsd.org, buh...@nfbcal.org

hello. Does this discussion imply that the WAPBL log/journaling
function is broken in NetBSD-current? Are we back to straight FFS as it
was before the days of WAPBL or softdep? Please tell me I'm mistaken about
this. If so, that's quite a regression, even from NetBSD-5 where both
WAPBL log and softdep work quite well.
-thanks
-Brian

Taylor R Campbell

unread,

Sep 22, 2016, 4:57:51 AM9/22/16

to Brian Buhrow, Paul Goyette, Jaromír Doleček, tech...@netbsd.org, buh...@nfbcal.org

Date: Wed, 21 Sep 2016 17:06:18 -0700
From: Brian Buhrow <buh...@nfbcal.org>

It is no more broken than it was in netbsd-5.

Thor Lancelot Simon

unread,

Sep 22, 2016, 7:50:38 AM9/22/16

to Jarom??r Dole??ek, tech...@netbsd.org

On Thu, Sep 22, 2016 at 01:27:48AM +0200, Jarom??r Dole??ek wrote:
>
> 3.2 use FUA (Force Unit Access) for commit record write
> This avoids need to issue even the second DIOCCACHESYNC, as flushing
> the disk cache is not really all that useful, I like the thread over
> at:
> http://yarchive.net/comp/linux/drive_caches.html
> Slightly less controversially, this would allow the rest of the
> journal records to be written asynchronously, leaving them to execute
> even after commit if so desired. It may be useful to have this
> behaviour optional. I lean towards skipping the disk cache flush as
> default behaviour however, if we implement write barrier for the
> commit record (see below).
> WAPBL would need to deal with drives without FUA, i.e fall back to cache flush.

I have never understood this business about needing FUA to implement
barriers. AFAICT, for any SCSI or SCSI-like disk device, all that is
actually needed is to do standard writes with simple tags, and barrier
writes with ordered tags. What am I missing?

I must have proposed adding a B_ARRIER or B_ORDERED at least five times
over the years. There are always objections...

Thor

Manuel Bouyer

unread,

Sep 22, 2016, 10:07:09 AM9/22/16

to Thor Lancelot Simon, Jarom??r Dole??ek, tech...@netbsd.org

On Thu, Sep 22, 2016 at 07:50:27AM -0400, Thor Lancelot Simon wrote:
> On Thu, Sep 22, 2016 at 01:27:48AM +0200, Jarom??r Dole??ek wrote:
> >
> > 3.2 use FUA (Force Unit Access) for commit record write
> > This avoids need to issue even the second DIOCCACHESYNC, as flushing
> > the disk cache is not really all that useful, I like the thread over
> > at:
> > http://yarchive.net/comp/linux/drive_caches.html
> > Slightly less controversially, this would allow the rest of the
> > journal records to be written asynchronously, leaving them to execute
> > even after commit if so desired. It may be useful to have this
> > behaviour optional. I lean towards skipping the disk cache flush as
> > default behaviour however, if we implement write barrier for the
> > commit record (see below).
> > WAPBL would need to deal with drives without FUA, i.e fall back to cache flush.
>
> I have never understood this business about needing FUA to implement
> barriers. AFAICT, for any SCSI or SCSI-like disk device, all that is
> actually needed is to do standard writes with simple tags, and barrier
> writes with ordered tags. What am I missing?

AFAIK ordered tags only guarantees that the write will happen in order,
but not that the writes are actually done to stable storage.
If you get a fsync() from userland, you have to do a cache flush (or FUA).

--
Manuel Bouyer <bou...@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

Thor Lancelot Simon

unread,

Sep 22, 2016, 9:33:33 PM9/22/16

to Manuel Bouyer, Jarom??r Dole??ek, tech...@netbsd.org

On Thu, Sep 22, 2016 at 04:06:55PM +0200, Manuel Bouyer wrote:
> On Thu, Sep 22, 2016 at 07:50:27AM -0400, Thor Lancelot Simon wrote:
> > On Thu, Sep 22, 2016 at 01:27:48AM +0200, Jarom??r Dole??ek wrote:
> > >
> > > 3.2 use FUA (Force Unit Access) for commit record write
> > > This avoids need to issue even the second DIOCCACHESYNC, as flushing
> > > the disk cache is not really all that useful, I like the thread over
> > > at:
> > > http://yarchive.net/comp/linux/drive_caches.html
> > > Slightly less controversially, this would allow the rest of the
> > > journal records to be written asynchronously, leaving them to execute
> > > even after commit if so desired. It may be useful to have this
> > > behaviour optional. I lean towards skipping the disk cache flush as
> > > default behaviour however, if we implement write barrier for the
> > > commit record (see below).
> > > WAPBL would need to deal with drives without FUA, i.e fall back to cache flush.
> >
> > I have never understood this business about needing FUA to implement
> > barriers. AFAICT, for any SCSI or SCSI-like disk device, all that is
> > actually needed is to do standard writes with simple tags, and barrier
> > writes with ordered tags. What am I missing?
>
> AFAIK ordered tags only guarantees that the write will happen in order,
> but not that the writes are actually done to stable storage.

The target's not allowed to report the command complete unless the data
are on stable storage, except if you have write cache enable set in the
relevant mode page.

If you run SCSI drives like that, you're playing with fire. Expect to get
burned. The whole point of tagged queueing is to let you *not* set that
bit in the mode pages and still get good performance.

Thor

David Holland

unread,

Sep 23, 2016, 12:24:19 AM9/23/16

to Paul Goyette, Jarom?r Dole?ek, tech...@netbsd.org

On Thu, Sep 22, 2016 at 07:57:00AM +0800, Paul Goyette wrote:
> While not particularly part of wapbl itself, I would like to see its
> callers (ie, lfs) be more modular!

lfs is not related to wapbl, or even (now) ufs.

> Currently, ffs (whether built-in or modular) has to be built with OPTIONS
> WAPBL enabled in order to use wapbl. And the ffs module has to "require"
> the wapbl module.

This is because there is allegedly-filesystem-independent wapbl code
that was thought to maybe be reusable for additional block-journaling
implementations, e.g. ext3. I have always had doubts about this and it
hasn't panned out so far.

--
David A. Holland
dhol...@netbsd.org

Manuel Bouyer

unread,

Sep 23, 2016, 5:50:38 AM9/23/16

to Thor Lancelot Simon, Jarom??r Dole??ek, tech...@netbsd.org

On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote:
> > AFAIK ordered tags only guarantees that the write will happen in order,
> > but not that the writes are actually done to stable storage.
>
> The target's not allowed to report the command complete unless the data
> are on stable storage, except if you have write cache enable set in the
> relevant mode page.
>
> If you run SCSI drives like that, you're playing with fire. Expect to get
> burned. The whole point of tagged queueing is to let you *not* set that
> bit in the mode pages and still get good performance.

Now I remember that I did indeed disable disk write cache when I had
scsi disks in production. It's been a while though.

But anyway, from what I remember you still need the disk cache flush
operation for SATA, even with NCQ. It's not equivalent to the SCSI tags.

Edgar Fuß

unread,

Sep 23, 2016, 6:21:26 AM9/23/16

to tech...@netbsd.org

> The whole point of tagged queueing is to let you *not* set [the write
> cache] bit in the mode pages and still get good performance.
I don't get that. My understanding was that TCQ allowed the drive to re-order
commands within the bounds described by the tags. With the write cache
disabled, all write commands must hit stable storage before being reported
completed. So what's the point of tagging with cacheing disabled?

David Holland

unread,

Sep 23, 2016, 7:05:22 AM9/23/16

to tech...@netbsd.org

You can have more than one in flight at a time. Typically the more you
can manage to have pending at once, the better the performance,
especially with SSDs.

Johnny Billquist

unread,

Sep 23, 2016, 7:12:44 AM9/23/16

to tech...@netbsd.org

Totally independent of any caching - disk I/O performance can be greatly
improved by reordering operations to minimize disk head movement. Most
of disk I/O times are head movements. I'd guess that makes up about 90%
of the time.

Johnny

--
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: b...@softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol

Edgar Fuß

unread,

Sep 23, 2016, 7:16:26 AM9/23/16

to tech...@netbsd.org

> You can have more than one in flight at a time.

My SCSI knowledge is probably out-dated. How can I have several commands
in flight concurrently?

Manuel Bouyer

unread,

Sep 23, 2016, 8:15:30 AM9/23/16

to tech...@netbsd.org

This is what tagged queueing is for.

Johnny Billquist

unread,

Sep 23, 2016, 8:36:29 AM9/23/16

to David Holland, tech...@netbsd.org

I'd say especially with rotating rust, but either way... :-)
Yes, that's the whole point of tagged queuing. Issue many operations.
Let the disk and controller sort out in which order to do them to make
it the most efficient.

With rotating rust, the order of operations can make a huge difference
in speed. With SSDs you don't have those seek times to begin with, so I
would expect the gains to be marginal.

Greg Troxel

unread,

Sep 23, 2016, 9:55:31 AM9/23/16

to Johnny Billquist, David Holland, tech...@netbsd.org

Johnny Billquist <b...@softjar.se> writes:

> With rotating rust, the order of operations can make a huge difference
> in speed. With SSDs you don't have those seek times to begin with, so
> I would expect the gains to be marginal.

For reordering, I agree with you, but the SSD speeds are so high that
pipeling is probably necessary to keep the SSD from stalling due to not
having enough data to write. So this could help move from 300 MB/s
(that I am seeing) to 550 MB/s.

signature.asc

Johnny Billquist

unread,

Sep 23, 2016, 10:00:20 AM9/23/16

to Greg Troxel, David Holland, tech...@netbsd.org

Good point. In which case (if I read you right), it's not the reordering
that matters, but the simple case of being able to queue up several
operations, to keep the disk busy. And potentially running several disks
in parallel. Keeping them all busy. And we of course also have the
pre-processing work before the command is queued, which can be done
while the controller is busy. There are many potential gains here.

Thor Lancelot Simon

unread,

Sep 23, 2016, 10:06:02 AM9/23/16

to Manuel Bouyer, Jarom??r Dole??ek, tech...@netbsd.org

On Fri, Sep 23, 2016 at 11:47:24AM +0200, Manuel Bouyer wrote:
> On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote:
> > > AFAIK ordered tags only guarantees that the write will happen in order,
> > > but not that the writes are actually done to stable storage.
> >
> > The target's not allowed to report the command complete unless the data
> > are on stable storage, except if you have write cache enable set in the
> > relevant mode page.
> >
> > If you run SCSI drives like that, you're playing with fire. Expect to get
> > burned. The whole point of tagged queueing is to let you *not* set that
> > bit in the mode pages and still get good performance.
>
> Now I remember that I did indeed disable disk write cache when I had
> scsi disks in production. It's been a while though.
>
> But anyway, from what I remember you still need the disk cache flush
> operation for SATA, even with NCQ. It's not equivalent to the SCSI tags.

I think that's true only if you're running with write cache enabled; but
the difference is that most ATA disks ship with it turned on by default.

With an aggressive implementation of tag management on the host side,
there should be no performance benefit from unconditionally enabling
the write cache -- all the available cache should be used to stage
writes for pending tags. Sometimes it works.

--
Thor Lancelot Simon t...@panix.com

"The dirtiest word in art is the C-word. I can't even say 'craft'
without feeling dirty." -Chuck Close

Thor Lancelot Simon

unread,

Sep 23, 2016, 10:10:15 AM9/23/16

to Greg Troxel, Johnny Billquist, David Holland, tech...@netbsd.org

The iSCSI case is illustrative, too. Now you can have a "SCSI bus" with
a huge bandwidth delay product. It doesn't matter how quickly the target
says it finished one command (which is all enabling the write-cache can get
you) if you are working in lockstep such that the initiator cannot send
more commands until it receives the target's ack.

This is why on iSCSI you really do see hundreds of tags in flight at
once. You can pump up the request size, but that causes fairness
problems. Keeping many commands active at the same time helps much more.

Now think about that SSD again. The SSD's write latency is so low that
_relative to the delay time it takes the host to issue a new command_ you
have the same problem. It's clear that enabling the write cache can't
really help, or at least can't help much: you need to have many commands
pending at the same time.

Our storage stack's inability to use tags with SATA targets is a huge
gating factor for performance with real workloads (the residual use of
the kernel lock at and below the bufq layer is another). Starting de
novo with NVMe, where it's perverse and structurally difficult to not
support multiple commands in flight simultaneously, will help some, but
SATA SSDs are going to be around for a long time still and it'd be
great if this limitation went away.

That said, I am not going to fix it myself so all I can do is sit here
and pontificate -- which is worth about what you paid for it, and no
more.

Thor

Warner Losh

unread,

Sep 23, 2016, 10:51:44 AM9/23/16

to Thor Lancelot Simon, Manuel Bouyer, Jarom??r Dole??ek, Tech-kern

On Fri, Sep 23, 2016 at 7:38 AM, Thor Lancelot Simon <t...@panix.com> wrote:
> On Fri, Sep 23, 2016 at 11:47:24AM +0200, Manuel Bouyer wrote:
>> On Thu, Sep 22, 2016 at 09:33:18PM -0400, Thor Lancelot Simon wrote:
>> > > AFAIK ordered tags only guarantees that the write will happen in order,
>> > > but not that the writes are actually done to stable storage.
>> >
>> > The target's not allowed to report the command complete unless the data
>> > are on stable storage, except if you have write cache enable set in the
>> > relevant mode page.
>> >
>> > If you run SCSI drives like that, you're playing with fire. Expect to get
>> > burned. The whole point of tagged queueing is to let you *not* set that
>> > bit in the mode pages and still get good performance.
>>
>> Now I remember that I did indeed disable disk write cache when I had
>> scsi disks in production. It's been a while though.
>>
>> But anyway, from what I remember you still need the disk cache flush
>> operation for SATA, even with NCQ. It's not equivalent to the SCSI tags.

All NCQ gives you is the ability to schedule multiple requests and
to get notification of their completion (perhaps out of order). There's
no coherency features are all in NCQ.

> I think that's true only if you're running with write cache enabled; but
> the difference is that most ATA disks ship with it turned on by default.
>
> With an aggressive implementation of tag management on the host side,
> there should be no performance benefit from unconditionally enabling
> the write cache -- all the available cache should be used to stage
> writes for pending tags. Sometimes it works.

You don't need to flush all the writes, but do need to take special care
if you need more coherent semantics, which often is a small minority
of the writes, so I would agree the affect can be mostly mitigated. Not
completely since any coherency point has to drain the queue completely.
The cache drain ops are non-NCQ, and to send non-NCQ requests
no NCQ requests can be pending. TRIM[*] commands are the same way.

Warner

[*] There is an NCQ version of TRIM, but it requires the AUX register
to be sent and very few sata hosts controllers support that (though
AHCI does, many of the LSI controllers don't in any performant way).

Warner Losh

unread,

Sep 23, 2016, 10:59:49 AM9/23/16

to Thor Lancelot Simon, Greg Troxel, Johnny Billquist, David Holland, Tech-kern

On Fri, Sep 23, 2016 at 8:05 AM, Thor Lancelot Simon <t...@panix.com> wrote:
> Our storage stack's inability to use tags with SATA targets is a huge
> gating factor for performance with real workloads (the residual use of
> the kernel lock at and below the bufq layer is another).

FreeBSD's storage stack does support NCQ. When that's artificially
turned off, performance drops on a certain brand of SSDs from about
500-550MB/s for large reads down to 200-300MB/s depending on
too many factors to go into here. It helps a lot for work loads and is
critical for Netflix to get 36-38Gbps rate from our 40Gbps systems.

> Starting de
> novo with NVMe, where it's perverse and structurally difficult to not
> support multiple commands in flight simultaneously, will help some, but
> SATA SSDs are going to be around for a long time still and it'd be
> great if this limitation went away.

NVMe is even worse. There's one drive that w/o queueing I can barely
get 1GB/s out of. With queueing and multiple requests I can get the
spec sheet rated 3.6GB/s. Here queueing is critical for Netflix to get to
90-93Gbps that our 100Gbps boxes can do (though it is but one of
many things).

> That said, I am not going to fix it myself so all I can do is sit here
> and pontificate -- which is worth about what you paid for it, and no
> more.

Yea, I'm just a FreeBSD guy lurking here.

Warner

Manuel Bouyer

unread,

Sep 23, 2016, 11:21:35 AM9/23/16

to Thor Lancelot Simon, Jarom??r Dole??ek, tech...@netbsd.org

On Fri, Sep 23, 2016 at 09:38:44AM -0400, Thor Lancelot Simon wrote:
> > But anyway, from what I remember you still need the disk cache flush
> > operation for SATA, even with NCQ. It's not equivalent to the SCSI tags.
>
> I think that's true only if you're running with write cache enabled; but
> the difference is that most ATA disks ship with it turned on by default.

all of them have it turned on by default, and you can't permanentely
disable it (you have to turn it off after each reset)

>
> With an aggressive implementation of tag management on the host side,
> there should be no performance benefit from unconditionally enabling
> the write cache -- all the available cache should be used to stage
> writes for pending tags. Sometimes it works.

With ATA you have only 32 tags ...

Eric Haszlakiewicz

unread,

Sep 23, 2016, 1:19:10 PM9/23/16

to Warner Losh, Thor Lancelot Simon, Manuel Bouyer, Jarom??r Dole??ek, Tech-kern

On September 23, 2016 10:51:30 AM EDT, Warner Losh <i...@bsdimp.com> wrote:
>All NCQ gives you is the ability to schedule multiple requests and
>to get notification of their completion (perhaps out of order). There's
>no coherency features are all in NCQ.

This seems like the key thing needed to avoid FUA: to implement fsync() you just wait for notifications of completion to be received, and once you have those for all requests pending when fsync was called, or started as part of the fsync, then you're done.

Eric

Thor Lancelot Simon

unread,

Sep 23, 2016, 1:39:49 PM9/23/16

to Eric Haszlakiewicz, Warner Losh, Manuel Bouyer, Jarom??r Dole??ek, Tech-kern

The other key point is that -- unless SATA NCQ is radically different from
SCSI tagged queuing in a particularly stupid way -- the rules require all
"simple" tags to be completed before any "ordered" tag is completed. That is,
ordered tags are barriers against all simple tags.

So, with the write cache disabled, you can use a single command with an
ordered tag to force all preceding commands to complete, but continue
issuing commands while that happens.

To me, this is considerably more elegant that assuming all commands will
"complete" only to the cache by default and then setting FUA for commands
where you can't tolerate that misbehavior -- and certainly better than
flushing the whole cache, which is roughly like blowing off your own head
because you have a pimple on your nose. But, clearly, others disagree.

Thor

Manuel Bouyer

unread,

Sep 23, 2016, 1:45:17 PM9/23/16

to Eric Haszlakiewicz, Warner Losh, Thor Lancelot Simon, Jarom??r Dole??ek, Tech-kern

On Fri, Sep 23, 2016 at 05:15:16PM +0000, Eric Haszlakiewicz wrote:

*if you have the write cache disabled*

Manuel Bouyer

unread,

Sep 23, 2016, 1:48:07 PM9/23/16

to Thor Lancelot Simon, Eric Haszlakiewicz, Warner Losh, Jarom??r Dole??ek, Tech-kern

On Fri, Sep 23, 2016 at 01:20:09PM -0400, Thor Lancelot Simon wrote:
> On Fri, Sep 23, 2016 at 05:15:16PM +0000, Eric Haszlakiewicz wrote:
> > On September 23, 2016 10:51:30 AM EDT, Warner Losh <i...@bsdimp.com> wrote:
> > >All NCQ gives you is the ability to schedule multiple requests and
> > >to get notification of their completion (perhaps out of order). There's
> > >no coherency features are all in NCQ.
> >
> > This seems like the key thing needed to avoid FUA: to implement fsync() you just wait for notifications of completion to be received, and once you have those for all requests pending when fsync was called, or started as part of the fsync, then you're done.
>
> The other key point is that -- unless SATA NCQ is radically different from
> SCSI tagged queuing in a particularly stupid way -- the rules require all
> "simple" tags to be completed before any "ordered" tag is completed. That is,
> ordered tags are barriers against all simple tags.

If I remember properly, there's only simple tags in ATA.

Thor Lancelot Simon

unread,

Sep 23, 2016, 1:49:37 PM9/23/16

to Manuel Bouyer, Eric Haszlakiewicz, Warner Losh, Jarom??r Dole??ek, Tech-kern

On Fri, Sep 23, 2016 at 07:45:00PM +0200, Manuel Bouyer wrote:
> On Fri, Sep 23, 2016 at 05:15:16PM +0000, Eric Haszlakiewicz wrote:
> > On September 23, 2016 10:51:30 AM EDT, Warner Losh <i...@bsdimp.com> wrote:
> > >All NCQ gives you is the ability to schedule multiple requests and
> > >to get notification of their completion (perhaps out of order). There's
> > >no coherency features are all in NCQ.
> >
> > This seems like the key thing needed to avoid FUA: to implement fsync() you just wait for notifications of completion to be received, and once you have those for all requests pending when fsync was called, or started as part of the fsync, then you're done.
>
> *if you have the write cache disabled*

*Running with the write cache enabled is a bad idea*

Manuel Bouyer

unread,

Sep 23, 2016, 1:51:43 PM9/23/16

to Thor Lancelot Simon, Eric Haszlakiewicz, Warner Losh, Jarom??r Dole??ek, Tech-kern

On Fri, Sep 23, 2016 at 01:46:09PM -0400, Thor Lancelot Simon wrote:
> > > This seems like the key thing needed to avoid FUA: to implement fsync() you just wait for notifications of completion to be received, and once you have those for all requests pending when fsync was called, or started as part of the fsync, then you're done.
> >
> > *if you have the write cache disabled*
>
> *Running with the write cache enabled is a bad idea*

On ATA devices, you can't permanently disable the write cache. You have
to do it on every power cycles.

Well this really needs to be carefully evaluated. With only 32 tags I'm not
sure you can efficiently use recent devices with the write cache
disabled (most enterprise disks have a 64M cache these days)

Warner Losh

unread,

Sep 23, 2016, 1:54:28 PM9/23/16

to Thor Lancelot Simon, Eric Haszlakiewicz, Manuel Bouyer, Jarom??r Dole??ek, Tech-kern

On Fri, Sep 23, 2016 at 11:20 AM, Thor Lancelot Simon <t...@panix.com> wrote:
> On Fri, Sep 23, 2016 at 05:15:16PM +0000, Eric Haszlakiewicz wrote:
>> On September 23, 2016 10:51:30 AM EDT, Warner Losh <i...@bsdimp.com> wrote:
>> >All NCQ gives you is the ability to schedule multiple requests and
>> >to get notification of their completion (perhaps out of order). There's
>> >no coherency features are all in NCQ.
>>
>> This seems like the key thing needed to avoid FUA: to implement fsync() you just wait for notifications of completion to be received, and once you have those for all requests pending when fsync was called, or started as part of the fsync, then you're done.
>
> The other key point is that -- unless SATA NCQ is radically different from
> SCSI tagged queuing in a particularly stupid way -- the rules require all
> "simple" tags to be completed before any "ordered" tag is completed. That is,
> ordered tags are barriers against all simple tags.

SATA NCQ doesn't have ordered tags. There's just 32 slots to send
requests into. Don't allow the word 'tag' to confuse you into thinking
it is anything at all like SCSI tags. You get ordering by not
scheduling anything until after the queue has drained when you send
your "ordered" command. It is that stupid.

Warner

Warner Losh

unread,

Sep 23, 2016, 1:58:31 PM9/23/16

to Thor Lancelot Simon, Eric Haszlakiewicz, Manuel Bouyer, Jarom??r Dole??ek, Tech-kern

And it can be even worse, since if the 'ordered' item must complete
after all before it, you have to drain the queue before you can even
send it to the drive. Depends on what the ordering guarantees you want
are...

Warner

Paul....@dell.com

unread,

Sep 24, 2016, 1:23:15 AM9/24/16

to e...@math.uni-bonn.de, tech...@netbsd.org

I'm not sure. But I have the impression that in the real world tagging is rarely, if ever, used.

paul

Paul....@dell.com

unread,

Sep 24, 2016, 1:23:34 AM9/24/16

to i...@bsdimp.com, t...@panix.com, bou...@antioche.eu.org, jaromir...@gmail.com, tech...@netbsd.org

David Holland

unread,

Sep 24, 2016, 4:01:37 AM9/24/16

to Manuel Bouyer, Thor Lancelot Simon, Eric Haszlakiewicz, Warner Losh, Jarom??r Dole??ek, Tech-kern

On Fri, Sep 23, 2016 at 07:51:32PM +0200, Manuel Bouyer wrote:
> > > *if you have the write cache disabled*
> >
> > *Running with the write cache enabled is a bad idea*
>
> On ATA devices, you can't permanently disable the write cache. You have
> to do it on every power cycles.

There are also drives that ignore attempts to turn off write caching.

Thor Lancelot Simon

unread,

Sep 24, 2016, 10:38:39 AM9/24/16

to Paul....@dell.com, e...@math.uni-bonn.de, tech...@netbsd.org

On Fri, Sep 23, 2016 at 01:02:26PM +0000, Paul....@dell.com wrote:

I'm not sure what you mean. Do you mean that tagging is rarely, if ever,
used _to establish write barriers_, or do you mean that tagging is rarely,
if ever used, period?

If the latter, you're way, way wrong.

Thor

Warner Losh

unread,

Sep 24, 2016, 5:14:31 PM9/24/16

to David Holland, Manuel Bouyer, Thor Lancelot Simon, Eric Haszlakiewicz, Jarom??r Dole??ek, Tech-kern

On Sat, Sep 24, 2016 at 2:01 AM, David Holland <dholla...@netbsd.org> wrote:
> On Fri, Sep 23, 2016 at 07:51:32PM +0200, Manuel Bouyer wrote:
> > > > *if you have the write cache disabled*
> > >
> > > *Running with the write cache enabled is a bad idea*
> >
> > On ATA devices, you can't permanently disable the write cache. You have
> > to do it on every power cycles.
>
> There are also drives that ignore attempts to turn off write caching.

These drives lie to the host and say that caching is off, when it
really is still on, right?

Warner

Michael van Elst

unread,

Sep 26, 2016, 10:29:50 AM9/26/16

to tech...@netbsd.org

i...@bsdimp.com (Warner Losh) writes:

>NVMe is even worse. There's one drive that w/o queueing I can barely
>get 1GB/s out of. With queueing and multiple requests I can get the
>spec sheet rated 3.6GB/s. Here queueing is critical for Netflix to get to
>90-93Gbps that our 100Gbps boxes can do (though it is but one of
>many things).

Luckily the Samsung 950pro isn't of that type. Can you tell what
NVMe devices (in particular in M.2 form factor) have that problem?

--
--
Michael van Elst
Internet: mle...@serpens.de
"A potential Snark may lurk in every tree."

Michael van Elst

unread,

Sep 26, 2016, 10:54:21 AM9/26/16

to tech...@netbsd.org

b...@softjar.se (Johnny Billquist) writes:

>Good point. In which case (if I read you right), it's not the reordering
>that matters, but the simple case of being able to queue up several
>operations, to keep the disk busy.

For sequential reading we are currently limited to 8 operations in
flight (uvm readahead). This is less an issue for local disks, but
it has a big impact on ISCSI. But it also makes reading through the
filesystem faster than reading from the raw disk device.

Warner Losh

unread,

Sep 26, 2016, 5:20:23 PM9/26/16

to Michael van Elst, Tech-kern

On Mon, Sep 26, 2016 at 8:27 AM, Michael van Elst <mle...@serpens.de> wrote:
> i...@bsdimp.com (Warner Losh) writes:
>
>>NVMe is even worse. There's one drive that w/o queueing I can barely
>>get 1GB/s out of. With queueing and multiple requests I can get the
>>spec sheet rated 3.6GB/s. Here queueing is critical for Netflix to get to
>>90-93Gbps that our 100Gbps boxes can do (though it is but one of
>>many things).
>
> Luckily the Samsung 950pro isn't of that type. Can you tell what
> NVMe devices (in particular in M.2 form factor) have that problem?

I've not used any m.2 devices. These tests were raw dd's of 128k I/Os
with one thread of execution, so no effective queueing at all. As
queueing gets involved, the performance increases dramatically as the
drive idle time drops substantially. I'd imagine most drives are like
this for the workload I was testing since you had to make a full
round-trip from the kernel to userland after the completion to get the
next I/O rather than having it already in the hardware... Unless
NetBSD's context switching is substantially faster than FreeBSD's, I'd
expect to see similar results there as well. Some cards do a little
better, but not by much... All cards to significantly better when
multiple transactions are scheduled simultaneously.

Just ran a couple of tests and found dd of 4k blocks gave me 160MB/s,
128k blocks gave me 600MB/s, 1M blocks gave me 636MB/s. random
read/write with 64 jobs and an I/O depth of 128 with 128k random reeds
with fio gave me 3.5GB/s. This particular drive is rated at 3.6GB/s.
This is for a HGST Ultrastar SN100. All numbers from FreeBSD. In
production, for unencrypted traffic, we see a similar number to the
deep queue fio test. While I've not tried on NetBSD, I'd be surprised
if you got significantly more than these numbers due to the round trip
to user land vs having the next request being present in the drive...

Warner

Michael van Elst

unread,

Sep 27, 2016, 2:45:24 AM9/27/16

to tech...@netbsd.org

i...@bsdimp.com (Warner Losh) writes:

>I've not used any m.2 devices. These tests were raw dd's of 128k I/Os
>with one thread of execution, so no effective queueing at all.

gossam: {4} dd if=/dev/rdk0 bs=128k of=/dev/null count=100000
100000+0 records in
100000+0 records out
13107200000 bytes transferred in 8.766 secs (1495231576 bytes/sec)

That's about 50% below the nominal speed due to syscall overhead
and no queuing. With bs=1024k the overhead is smaller, the device
is rated at 2.5GB/s for reading.

gossam: {7} dd if=/dev/rdk0 bs=1024k of=/dev/null count=10000 &
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 4.371 secs (2398938458 bytes/sec)

>Just ran a couple of tests and found dd of 4k blocks gave me 160MB/s,
>128k blocks gave me 600MB/s, 1M blocks gave me 636MB/s. random
>read/write with 64 jobs and an I/O depth of 128 with 128k random reeds
>with fio gave me 3.5GB/s. This particular drive is rated at 3.6GB/s.

Yes, those are similar results. With multiple dd's the numbers almost
add up until the CPUs become the bottleneck.

However, I was looking for devices that even fail the dd test with
large buffers. Apparently there are devices where you must use
concurrent I/O operations to reach their nominal speed, otherwise
you only get a fraction (maybe 20-30%).

Jaromír Doleček

unread,

Sep 28, 2016, 7:22:40 AM9/28/16

to Manuel Bouyer, Thor Lancelot Simon, Eric Haszlakiewicz, Warner Losh, Tech-kern

I think it's far assesment to say that on SATA with NCQ/31 tags (max
is actually 31, not 32 tags), it's pretty much impossible to have
acceptable write performance without using write cache. We could never
saturate even drive with 16MB cache with just 31 tags and 64k maxphys.
So it's IMO not useful to design for world without disk drive write
cache.

Back to discussion about B_ORDERED:

As was said before, SCSI ORDERED tag does precisely what we want for
journal commit record - it forces all previous commands sent to
controller to be finished before the one with ORDERED tag is
processed, and any commands sent after the ORDERED tagged one are
executed only after the previous ordered command is finished. No need
for any bufq magic there, which is wonderful. Too bad that NCQ doesn't
provide this.

That said, we still need to be sure that all the previous commands
were sent prior to pushing ORDERED command to SCSI controller. Are
there any SCSI controllers with multiple submission queues (like
NVMe), regardless of our scsipi layer MP limitations?

FWIW AHCI is single-threaded by design, every command submission has
to write to same set of registers.

Jaromir

Paul....@dell.com

unread,

Sep 28, 2016, 4:04:04 PM9/28/16

to jaromir...@gmail.com, bou...@antioche.eu.org, t...@panix.com, e...@nimenees.com, i...@bsdimp.com, tech...@netbsd.org

> On Sep 28, 2016, at 7:22 AM, Jaromír Doleček <jaromir...@gmail.com> wrote:
>
> I think it's far assesment to say that on SATA with NCQ/31 tags (max
> is actually 31, not 32 tags), it's pretty much impossible to have
> acceptable write performance without using write cache. We could never
> saturate even drive with 16MB cache with just 31 tags and 64k maxphys.
> So it's IMO not useful to design for world without disk drive write
> cache.

I think that depends on the software. In a SAN storage array I work on, we used to use SATA drives, always with cache disabled to avoid data loss due to power failure. We had them running just fine with NCQ. (For that matter, even without NCQ, though that takes major effort.)

So perhaps an optimization effort is called for, if people view this performance issue as worth the trouble. Or you might decide that for performance SAS is the answer, and SATA is only for non-critical applications.

paul