sync vs async vs zfs

170 views
Skip to first unread message

Quartz

unread,
Sep 24, 2015, 12:40:39 PM9/24/15
to FreeBSD questions
I'm trying to spec out a new system that looks like it might be very
sensitive to sync vs async writes. However, after some research and
investigation I've come to realize that I don't think I understand
a/sync as well as I thought I did and might be confused about some of
the fundamentals.

Can someone point me to a good "newbie's guide" that explains sync vs
async from the ground up? one that makes no assumptions about prior
knowledge of filesystems and IO. And likewise, another guide
specifically for how they relate to zfs pool/vdev configuration?

Thanks in advance.
_______________________________________________
freebsd-...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questi...@freebsd.org"

Paul Kraus

unread,
Sep 24, 2015, 3:06:12 PM9/24/15
to Quartz, FreeBSD questions
On Sep 24, 2015, at 12:40, Quartz <qua...@sneakertech.com> wrote:

> I'm trying to spec out a new system that looks like it might be very sensitive to sync vs async writes. However, after some research and investigation I've come to realize that I don't think I understand a/sync as well as I thought I did and might be confused about some of the fundamentals.

Very short answer…

Both terms refer to writes only, there is no such thing as a sync or async read.

In the case of an async write, the application code (App) asks the Filesystem (FS) to write some data. The FS is free to do whatever it wants with the data and respond immediately that is has the data and it _will_ write it to non-volatile (NV) storage (disk).

In the case of a sync write (at least as defined by Posix), the App asks the FS to write some data and do not return until it is committed to NV storage. The FS is required (by Posix) to _not_ acknowledge the write until the data _has_ been committed to NV storage.

So in the first case, the FS can accept the data, put it in it’s “write cache”, typically RAM, and respond to the App that the write is complete. When the FS has the time it then commits the data to NV storage. If the system crashes after the App has “written” the data but before the FS has committed it to NV storage, that data is lost.

In the second case, the FS _must_not_ respond to the APP until the data is committed to NV storage. The App can be certain that the data is safe. This is critical for, among other things, databases processing transactions in specific order or time.

> Can someone point me to a good "newbie's guide" that explains sync vs async from the ground up? one that makes no assumptions about prior knowledge of filesystems and IO. And likewise, another guide specifically for how they relate to zfs pool/vdev configuration?

I don’t know of a basic guide to this, I just learned it from various places over 20 years in the business.

In terms of ZFS, the ARC acts as both write buffer and read cache. You can see this easily when running benchmarks such as iozone with files smaller than the amount of RAM. When making an async write call the FS responds almost immediately and you are measuring the efficiency of the ZFS code and memory bandwidth :-) I have seen write performance in the 10’s of GB/sec on drives that I know do not have that kind of bandwidth. Make the ARC too small to hold the entire file or make the file too big to fit you start seeing the performance of the drives. This is due (in part) to the TXG design of ZFS. You can watch the drives (via iostat -x) and see ZFS committing data in bursts (originally up to 30 seconds apart, now up to 5 seconds apart).

Now when you issue a sync write to ZFS, in order to adhere to Posix requirements, ZFS _must_ commit the data to NV storage before returning an acknowledgement to the App. So ZFS has the ZIL (ZFS Intent Log). All sync writes are committed to the ZIL immediately and then incorporated into the dataset itself as TXGs commit. The ZIL is just space stolen from the zpool _unless_ you have a Separate Log Device (SLOG), which is just a special type of vdev (like spare) and is listed as “log” in a zpool status. By having a SLOG you can do two things, 1) ZFS no longer needs to steal space from the dataset for the ZIL, so the dataset will be much less fragmented and 2) you can use a device which is much faster than the main zpool devices (like a ZeusRAM or fast SSD) and greatly speed up sync writes.

You can see the performance difference between async and sync using iozone with the -o option. From the iozone manage: "Writes are synchronously written to disk. (O_SYNC). Iozone will open the files with the O_SYNC flag. This forces all writes to the file to go completely to disk before returning to the benchmark.”

I hope this gets you started …

--
Paul Kraus
pa...@kraus-haus.org

Quartz

unread,
Sep 24, 2015, 4:53:38 PM9/24/15
to FreeBSD questions
> Very short answer…

OK, thanks. So far that lines up with what I thought I knew. I still
think I might be fuzzy on what constitutes an 'app' in this context
though, presumably you're also counting services like nfs, etc?
Basically, when considering just boring file copies, which things are or
are not async and when? Under what circumstances is sync actually used
in the real world?


>you can
> use a device which is much faster than the main zpool devices

Also

1) A SLOG's only purpose is to reduce fragmentation and increase sync
speed, correct? Re: speed, using a SLOG that's the same speed as the
other drives in a pool is mostly pointless, right?

2) Async doesn't really care how your pool is constructed, and a SLOG is
really the only thing that seriously makes a difference for sync, correct?

Paul Kraus

unread,
Sep 24, 2015, 9:11:22 PM9/24/15
to Quartz, FreeBSD questions
On Sep 24, 2015, at 16:53, Quartz <qua...@sneakertech.com> wrote:

>> Very short answer…
>
> OK, thanks. So far that lines up with what I thought I knew. I still think I might be fuzzy on what constitutes an 'app' in this context though, presumably you're also counting services like nfs, etc?

Anything that generates a FS read or write request :-) So yes, the kernel NFS server counts.

> Basically, when considering just boring file copies, which things are or are not async and when? Under what circumstances is sync actually used in the real world?

I expect that system utilities like cp and tar do not do sync writes. sync writes are supposed to be a special case, used only when needed. I run into them with VBox writing to <>.vmdk files.

> you can
>> use a device which is much faster than the main zpool devices
>
> Also
>
> 1) A SLOG's only purpose is to reduce fragmentation and increase sync speed, correct? Re: speed, using a SLOG that's the same speed as the other drives in a pool is mostly pointless, right?

Correct. And I proved that on one of my servers in pre-prodcuction testing. I was able to find the bottleneck using iozone -o and then added a mirrored pair of SSD as SLOG write performance went _down_ for 4KB random writes! I then tested the SSDs on their own and confirmed that the performance I was seeing was the native performance of the SSDs. I asked for recommendations of a good, fast SSD over on the OpenZFS list and ordered a pair of Intel 200 GB S3710 SSDs, they are back ordered, so the server awaits full production use.

> 2) Async doesn't really care how your pool is constructed, and a SLOG is really the only thing that seriously makes a difference for sync, correct?

Not quite true. Once you get through the ARC the configuration of the zpool _will_ matter to performance. In fact, for reads, unless you workload closely matches the prefetch algorithm, the zpool layout will have an effect on performance. Remember, as a general rule, you get one spindle’s (drive’s) worth of performance per top level vdev in the zpool. So a zpool with one vdev that is an 8 drive RAIDz2 will have much less performance than a set of 4 2-way mirrors.

--
Paul Kraus
pa...@kraus-haus.org

Quartz

unread,
Sep 24, 2015, 10:05:48 PM9/24/15
to FreeBSD questions
> I expect that system utilities like cp and tar do not do sync writes.
> sync writes are supposed to be a special case, used only when needed.
> I run into them with VBox writing to<>.vmdk files.

NFS forces sync though, doesn't it? What if you're cp-ing to a mounted
share? I'm not sure I totally understand how all this interacts.


>> 2) Async doesn't really care how your pool is constructed, and a
>> SLOG is really the only thing that seriously makes a difference for
>> sync, correct?
>
> Not quite true. Once you get through the ARC the configuration of the
> zpool _will_ matter to performance.

Maybe I worded that badly. What I meant was that whereas sync write
performance is strongly affected by a SLOG, async writes have no special
considerations of their own that don't also affect sync, right?

Quartz

unread,
Sep 24, 2015, 10:10:27 PM9/24/15
to FreeBSD questions
>> 1) A SLOG's only purpose is to reduce fragmentation and increase
>> sync speed, correct? Re: speed, using a SLOG that's the same speed
>> as the other drives in a pool is mostly pointless, right?
>
> Correct. And I proved that on one of my servers in pre-prodcuction
> testing.

Tack-on question: would an identical-speed SLOG still speed up the pool
by proxy simply by reducing IO load on the vdev(s) and/or reducing head
travel on the drives?

Paul Kraus

unread,
Sep 24, 2015, 10:49:52 PM9/24/15
to Quartz, FreeBSD questions
On Sep 24, 2015, at 22:05, Quartz <qua...@sneakertech.com> wrote:

>> I expect that system utilities like cp and tar do not do sync writes.
>> sync writes are supposed to be a special case, used only when needed.
>> I run into them with VBox writing to<>.vmdk files.
>
> NFS forces sync though, doesn't it? What if you're cp-ing to a mounted share? I'm not sure I totally understand how all this interacts.

Reading the NFS ver 3 RFC (1813, available here: https://tools.ietf.org/html/rfc1813, look for the descriptions of the WRITE and COMMIT procedures), it looks like NFS ver 2 was sync only, while ver 3 added the ability (at the protocol layer) to require sync data, sync data + metadata, or async behavior. From some Linux NFS notes I found it appears that the default Linux behavior is sync, but you can disable sync on the server side (as you can with ZFS) in which case the NFS server does not follow the protocol specification.

From the FreeBSD man page options for mount_nfs:
wcommitsize=<value>
Set the maximum pending write commit size to the speci-
fied value. This determines the maximum amount of pend-
ing write data that the NFS client is willing to cache
for each file.

This implies that the FreeBSD NFS client fully implements the ver 3 protocol which permits the client to cache data until an fsync call is made or the amount of data in the client cache reaches “wcommitsize”. Essentially, async behavior.

It does not look like (from reading the exports man page) you can disable handling of sync data, sync data + metadata, or async on the NFS server side. In other words, the FreeBSD NFS server does what the client instructs it, as it should to comply with the NFS Ver 3 specification in the RFC.

>>> 2) Async doesn't really care how your pool is constructed, and a
>>> SLOG is really the only thing that seriously makes a difference for
>>> sync, correct?
>>
>> Not quite true. Once you get through the ARC the configuration of the
>> zpool _will_ matter to performance.
>
> Maybe I worded that badly. What I meant was that whereas sync write performance is strongly affected by a SLOG, async writes have no special considerations of their own that don't also affect sync, right?

Correct. Any configuration change that effects async performance will effect sync performance as well. While the ZIL/SLOG can effect sync performance, they are not involved in async writer operations at all, so cannot have nay effect on sync writes.

Although, I suppose you _could_ say that an increase in the amount of RAM available for the ARC may increase async write performance, such an increase in ARC would have little to no effect on sync writes. While sync writes _do_ go through the ARC, the ZIL/SLOG insulates the App trying to write the data from that.

--
Paul Kraus
pa...@kraus-haus.org
Reply all
Reply to author
Forward
0 new messages