ashift and 4k disk alignment... Is ZFS cancelling out our partitioning efforts?

1,860 views
Skip to first unread message

Daniel Smedegaard Buus

unread,
Jul 20, 2011, 8:30:40 AM7/20/11
to zfs-fuse
Hello :)

Stumbled upon this by accident while reading up on stuff regarding ZFS
on Linux (on a side note, progress seems to be massive over there):
http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html

What exactly does this mean to us owners of 4k drives? Does this mean
that performance with ZFS on these drives is the same (as in sub-
optimal) regardless of whether our partitions are properly or
improperly aligned? Or does it mean that we (owners of 4k drives)
should actually do both - that is, properly align our partitions AND
modify ZFS to use this similar 4k alignment internally? Or is it all
just FUD?

Please, some thoughts from those of you more informed than me :)

Cheers,
Daniel :)

sgheeren

unread,
Jul 20, 2011, 8:50:20 AM7/20/11
to zfs-...@googlegroups.com
On 07/20/2011 02:30 PM, Daniel Smedegaard Buus wrote:
> Hello :)
>
> Stumbled upon this by accident while reading up on stuff regarding ZFS
> on Linux (on a side note, progress seems to be massive over there):
> http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html
Also, see this
https://github.com/zfsonlinux/zfs/issues/289#issuecomment-1495456
in relation to SSDs and erase blocks. It is not over yet, ashift=12
might not be enough for you.

> What exactly does this mean to us owners of 4k drives? Does this mean
> that performance with ZFS on these drives is the same (as in sub-
> optimal) regardless of whether our partitions are properly or
> improperly aligned? Or does it mean that we (owners of 4k drives)
> should actually do both - that is, properly align our partitions AND
> modify ZFS to use this similar 4k alignment internally? Or is it all
> just FUD?
It's not FUD, however I have no data on it, other than that my Solaris
nv147 pool runs nicely on whole disk with EADS as well as EARS disks (I
purposely mixed mirrored pools so as to avoid wearing both disks out at
around the same time). zdb -C tells m all three pools are ashift=9...

Manuel Amador (Rudd-O)

unread,
Jul 20, 2011, 3:52:39 PM7/20/11
to zfs-...@googlegroups.com
Do BOTH. I have those disks, and I get a halving in write performance
with partitions misaligned, even with ashift=12. The good news is, with
today's fdisk, even if you have bad sector size reporting from the disk,
you have to PURPOSEFULLY misalign your partitions to a non-multiple of
4K if you want to get it wrong. In the olden days the first sector
would be 63, now it's 2048.

Manuel Amador (Rudd-O)

unread,
Jul 20, 2011, 3:54:28 PM7/20/11
to zfs-...@googlegroups.com
With SSDs, it's all over the place. Using ashift=12 helps a bit, but
what helps MUCH MORE is starting sector alignment (aligned to erase
block size). We're talking about 25 to 90 MB/s write performance, and a
similar boost in read performance.

I know this because I tested this. I posted a data.ods file on
zfsonlinux earlier when I was redoing my laptop SSD and I measured it.

Daniel Smedegaard Buus

unread,
Jul 21, 2011, 7:40:16 AM7/21/11
to zfs-fuse
Thanks, both of you. Feels terrible having just created a new 19-drive
38 TB RAIDZ-3 pool and mirrored back 10 TB of data, discovering this :
( I was very careful in creating GPTs with proper 4k alignments (all
drives are 4k, most of them are EADS/EARS), and now this... At least
it seems the performance penalty with ashift=9 isn't necessarily that
massive:

http://digitaldj.net/2010/11/03/zfs-zpool-v28-openindiana-b147-4k-drives-and-you/

Would be nice to see some more numbers, though.

What do you think about the possibility of a future zfs revision
offering an "ashift upgrade" option? I'm guessing slim to no chance in
hell :)

Daniel Smedegaard Buus

unread,
Jul 21, 2011, 7:42:17 AM7/21/11
to zfs-fuse
One more thing: If I were to use a "hacked" zfs build with ashift=12
hardcoded into it, could I just continue using the pool with "normal"
builds such as the official zfs-fuse build and native FreeBSD version?
Or is that "undetermined"?

Manuel Amador (Rudd-O)

unread,
Jul 21, 2011, 7:06:47 PM7/21/11
to zfs-...@googlegroups.com
It won't work. You'll likely corrupt the data if you force the ashift
12 in the internal data structures.

sgheeren

unread,
Jul 22, 2011, 4:46:47 AM7/22/11
to zfs-...@googlegroups.com
On 07/22/2011 01:06 AM, Manuel Amador (Rudd-O) wrote:
> It won't work. You'll likely corrupt the data if you force the ashift
> 12 in the internal data structures.

What other change is necessary? Where does that change come from
(upstream onnv-gate latest version or LLNL?)

In the case of upstream, it should already have been merged into
unstable and perhaps testing. I'd like to check, though, so any
specifics would be welcome

sgheeren

unread,
Jul 22, 2011, 4:53:50 AM7/22/11
to zfs-...@googlegroups.com
On 07/22/2011 10:46 AM, sgheeren wrote:
On 07/22/2011 01:06 AM, Manuel Amador (Rudd-O) wrote:
It won't work.  You'll likely corrupt the data if you force the ashift
12 in the internal data structures.
What other change is necessary? Where does that change come from
(upstream onnv-gate latest version or LLNL?)

In the case of upstream, it should already have been merged into
unstable and perhaps testing. 
For example, this is what we have now:

git log -Sashift unstable
commit af19acde5f7cd5791d158012bcef1f4aace4ef73
Author: Victor Latushkin <Victor.L...@Sun.COM>
Date:   Sun Feb 21 22:58:08 2010 +0100

    hg commit 11725:6720637 want zdb -l option to dump uberblock arrays as well

commit db2f633064b5b229ddc26b1003dadff3dbfcab85
Author: Mark J Musante <Mark.M...@Sun.COM>
Date:   Wed Feb 17 15:19:58 2010 +0100

    hg commit 11422:PSARC/2009/511 zpool split
    5097228 provide 'zpool split' to create new pool by breaking all mirrors
    6880831 memory leak in zpool add
    6891438 zfs_ioc_userspace_upgrade could reference uninitialised error variable
    6891441 zvol_create_minor sets local variable zv but never references it
    6891442 spa_import() sets local variable spa but never references it
    6895446 vdevs left open after removing slogs or offlining device/file

commit 5cdd8cf8067a48b121e39a6a1766238bfa8b98b2
Author: Jeff Bonwick <Jeff.B...@Sun.COM>
Date:   Tue Nov 10 15:02:11 2009 +0100

    hg commit 10922:PSARC 2009/571 ZFS Deduplication Properties
    6677093 zfs should have dedup capability

commit c8e9062d8679f9a30fbdb826ac7d9f8857f35e06
Author: Adam Leventhal <adam.le...@sun.com>
Date:   Wed Nov 4 13:55:58 2009 +0100

    hg commit 10105:6854612 triple-parity RAID-Z
    6854621 RAID-Z should mind the gap on writes


Fajar A. Nugraha

unread,
Jul 22, 2011, 5:03:14 AM7/22/11
to zfs-...@googlegroups.com

It depends on how you define '"hacked" zfs build with ashift=12
hardcoded into it'.

If, like Rudd-O implies, you start with an existing pool and edit the
on disk data to somehow force ashift=12, then most likely it would
lead to corruption.

However, if you mean "some implementation of zfs/zpool that can force
the use of ashift=12 at pool creation time", then the resulting pool
should be accessible in other implementations.

For example, zfsonlinux has ashift as zpool create option which you
could easily use to create pool with ashift=12. In (open)solaris you'd
need to use a workaround using a device which has 4k sector size as
top level vdev (iscsi is easiest), see
http://www.mail-archive.com/zfs-d...@opensolaris.org/msg46498.html
or http://opensolaris.org/jive/thread.jspa?threadID=139316 for
example. The resulting pool (whichever implementation/method used to
create it) should be readable on other implementation (as long as it's
capable of reading the pool version).

--
Fajar

sgheeren

unread,
Jul 22, 2011, 6:10:14 AM7/22/11
to zfs-...@googlegroups.com
On 07/22/2011 11:03 AM, Fajar A. Nugraha wrote:
> If, like Rudd-O implies, you start with an existing pool and edit the
> on disk data to somehow force ashift=12, then most likely it would
> lead to corruption.
Ok, I get it. I'm not sure that he was implying such shotgun surgery but
good points on the consequences of course!

Daniel Smedegaard Buus

unread,
Jul 23, 2011, 4:17:07 AM7/23/11
to zfs-fuse
Sorry for the confusion guys, what I meant was creating a fresh pool
with ashift=12 using a hacked binary, then throwing away the binary
and installing more generic ones, and whether or not that would be
possible. Seems it would, thanks :)

I think the confusion stems from adding my other question - whether in
time it's plausible or even possible that a future set of utils would
be able to "upgrade" a sub-optimal pool configuration of ashift=9 to
ashift=12. This is the question that haunts me the most, because the
answer might mean I'll have to re-create my newly created pool once
more and that's a pretty darn time-consuming process :)

Thanks for all your help!

Christ Schlacta

unread,
Jul 23, 2011, 5:50:01 AM7/23/11
to zfs-...@googlegroups.com
zfs send | zfs recv isn't so bad, you can even use -R if you have enough
drive space at your disposal to simply do it, then excellent, and if
not, you need to find a way to store it while you do the replace...
however, it is indeed quite simple, and a "set it and forget it" process.
Reply all
Reply to author
Forward
0 new messages