Best practices for ZFS setup for a strictly SSD based system?

Patrick M. Hausen

unread,

Feb 9, 2016, 11:14:33 AM2/9/16

to

Hi, all,

while there is quite a bit of documentation on how to improve ZFS performance
by using a combination of rotating disks and SSDs, I have not found much about
an SSD only setup.

We are planning to try a hosting server with 8 SATA SSDs with ZFS. Things I am
not at all sure about:

* Does the recommended limit of 6 disks for a RAIDZ2 still
hold? 2x 4 disks is quite a bit of overhead, could I use all 8
in one vdev and get away with it?
(The maximum of 6 recommendation is in some old Sun doc)

* Will e.g. MySQL still profit from residing on a mirror
instead of a RAIDZ2, even if all disks are SSDs?

* Does a separate ZIL and/or ARC cache device still
make sense?

Any pointers or direct help greatly appreciated. Or should I take this to freebsd-fs@?

Thanks and best regards,
Patrick
--
punkt.de GmbH * Kaiserallee 13a * 76133 Karlsruhe
Tel. 0721 9109 0 * Fax 0721 9109 100
in...@punkt.de http://www.punkt.de
Gf: Jürgen Egeling AG Mannheim 108285

_______________________________________________
freebsd...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"

Alan Somers

unread,

Feb 9, 2016, 11:35:53 AM2/9/16

to

On Tue, Feb 9, 2016 at 8:54 AM, Patrick M. Hausen <hau...@punkt.de> wrote:
> Hi, all,
>
> while there is quite a bit of documentation on how to improve ZFS performance
> by using a combination of rotating disks and SSDs, I have not found much about
> an SSD only setup.
>
> We are planning to try a hosting server with 8 SATA SSDs with ZFS. Things I am
> not at all sure about:
>
> * Does the recommended limit of 6 disks for a RAIDZ2 still
> hold? 2x 4 disks is quite a bit of overhead, could I use all 8
> in one vdev and get away with it?
> (The maximum of 6 recommendation is in some old Sun doc)

Nah, you can go much higher. This post describes the RAIDZ overhead.
The main penalty to larger stripes is lower IOPs. Your RAIDZ array
will have the same read IOPs as a single SSD, no matter how large it
is. So, for example, a pool of two RAIDZ stripes each containing 4+2
disks will have twice the IOPS as a pool containing one RAIDZ stripe
with 8+2 disks, and about the same storage overhead.

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/

>
> * Will e.g. MySQL still profit from residing on a mirror
> instead of a RAIDZ2, even if all disks are SSDs?

Yes, because a mirrored vdev has as many read IOPs as all of its disks
combined. So a RAID10 of SSDs will have many read IOPs indeed.

>
> * Does a separate ZIL and/or ARC cache device still
> make sense?

Usually no. But it might make a difference if the ZIL or L2ARC
devices have different characteristics from the regular devices. For
example, you might use medium speed MLC flash for your regular vdevs
and a very fast, small SLC device for the ZIL. But I wouldn't do it
unless you thoroughly test it with your workload.

>
> Any pointers or direct help greatly appreciated. Or should I take this to freebsd-fs@?

Will MySQL access its files in fixed-size records? If so, you can set
the recsize filesystem property accordingly. If not, you should
probably leave recsize at the default. If you profile MySQL's disk
accesses and determine that there is a dominant recordsize, then go
ahead and set ZFS's recsize to the next highest power of two.

As usual, disable atime.

>
> Thanks and best regards,
> Patrick
> --

-Alan

Patrick M. Hausen

unread,

Feb 9, 2016, 12:05:50 PM2/9/16

to

Hi!

> Am 09.02.2016 um 17:32 schrieb Alan Somers <aso...@freebsd.org>:
> [...]

> http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/
>
>>
>> * Will e.g. MySQL still profit from residing on a mirror
>> instead of a RAIDZ2, even if all disks are SSDs?
>
> Yes, because a mirrored vdev has as many read IOPs as all of its disks
> combined. So a RAID10 of SSDs will have many read IOPs indeed.

Ah … yes. Now I remember :)

> […]

> Will MySQL access its files in fixed-size records? If so, you can set
> the recsize filesystem property accordingly. If not, you should
> probably leave recsize at the default. If you profile MySQL's disk
> accesses and determine that there is a dominant recordsize, then go
> ahead and set ZFS's recsize to the next highest power of two.
>
> As usual, disable atime.

We already knew these. But thanks a lot for the vdev setup
hints! So it will be a mirror for OS and DB and a 4+2 raidz2
for the rest of the data.

Our MySQL zvols are currently set up like this:

DB files:
recordsize=16k
atime=off
primarycache=metadata

InnoDB log files:
recordsize=128k
(rest inherited from above)

Kind regards,

Patrick
--
punkt.de GmbH * Kaiserallee 13a * 76133 Karlsruhe
Tel. 0721 9109 0 * Fax 0721 9109 100
in...@punkt.de http://www.punkt.de
Gf: Jürgen Egeling AG Mannheim 108285

_______________________________________________

Jan Bramkamp

unread,

Feb 9, 2016, 12:28:51 PM2/9/16

to

On 09/02/16 16:54, Patrick M. Hausen wrote:
> Hi, all,
>
> while there is quite a bit of documentation on how to improve ZFS performance
> by using a combination of rotating disks and SSDs, I have not found much about
> an SSD only setup.
>
> We are planning to try a hosting server with 8 SATA SSDs with ZFS. Things I am
> not at all sure about:
>
> * Does the recommended limit of 6 disks for a RAIDZ2 still
> hold? 2x 4 disks is quite a bit of overhead, could I use all 8
> in one vdev and get away with it?
> (The maximum of 6 recommendation is in some old Sun doc)

There are multiple reasons to limit number of disks per RAID-Z VDEV.

* Resilver time: ZFS has to process all objects ordered by transaction
id to resilver a RAID-Z. Resilvering is a torture test for the remaining
disks of your degraded RAID-Z and with the ratio of bandwidth to
capacity of current hard disks resilvering takes too long. This isn't an
issue for SSDs.

* For performance estimations think of the RAID-Z of one huge disk
with larger blocks but the same IOPS as the slowest disk in the RAID-Z.
Databases perform disk I/O in small blocks limiting your RAID-Z to the
performance of about one of its member disks.

* A ZFS pool can only grow by adding whole VDEVS or replacing all
disks in a VDEV one at a time. Using mirror allows the pool to grow in
smaller increments.

> * Will e.g. MySQL still profit from residing on a mirror
> instead of a RAIDZ2, even if all disks are SSDs?

Yes OpenZFS schedules reads on mirrors to the disk with the shortest
queue thus a mirror offers about sum of its member disks in read
performance (IOPS and bandwidth) and the minimum of its member disks in
write performance (IOPS and bandwidth). A pool with as many mirrored
VDEVs as possible will offer the optimal performance for a given number
of disks. For write heavy workloads the quality of the SSDs matters a
lot as well. Cheap consumer SSDs can't sustain high write rates for any
length of time. Even medium quality SSDs have a lot of jitter and suffer
from throughput degradation under sustained write loads. Optimized
server SSDs can sustain random write workloads with little jitter and
bounded latency.

A NVMe SSD can offer an additional order of magnitude performance
increase over SATA SSDs but at a significant increase in price. With
multiple NVMe SSDs you will run into the current scalability limits of
ZFS and GEOM.

> * Does a separate ZIL and/or ARC cache device still
> make sense?

Most likely not.

An other optimization is splitting the log and table space and creating
a dedicated ZFS dataset for each. Create the dataset containing the
table space with the fixed record size of your MySQL backend. ZFS also
offers a lot more consistency and atomicity quarantines than required
by a minimal POSIX file system. This allows you to further reduce the
syncing overhead by tuning MySQL to take advantage of ZFS quarantines.

krad

unread,

Feb 10, 2016, 5:15:43 AM2/10/16

to

Dont forget alignment and ashift. You may also want to test compression as
well. IF you have spare cpu cycles I would imagine the systems cpu will
handle it faster than any onboard ssd compression. Benchmarking would be of
use here though.

Patrick M. Hausen

unread,

Feb 11, 2016, 2:58:44 AM2/11/16

to

Hi, all,

> Am 10.02.2016 um 11:15 schrieb krad <kra...@gmail.com>:
>
> Dont forget alignment and ashift. You may also want to test compression as well. IF you have spare cpu cycles I would imagine the systems cpu will handle it faster than any onboard ssd compression. Benchmarking would be of use here though.

Correct. Just for the record: since 10.2 the FreeBSD installer does the right thing [tm].

ashift=12 and partitions are 1M aligned.