Huzah! zfs-fuse on dual SSD screaming!

140 views
Skip to first unread message

sghe...@hotmail.com

unread,
Oct 8, 2009, 11:47:47 AM10/8/09
to zfs-...@googlegroups.com
I'm so happy. Just received and installed twice the Corsair Extreme 64G SSDs (rated raw r/w 240/135Mb/s by manufacturer).

Guess what... first thing I did:

------
PS. Worth noting I'm on OS and package versions stated below
Anyone have experience tuning ZFS for SSD?
------

$ zpool status
$ zpool create ssd /dev/disk/by-id/ata-Corsair_CMFSSD64-D1_[ST]*
  pool: ssd
 state: ONLINE
 scrub: none requested
config:

    NAME                                                       STATE     READ WRITE CKSUM
    ssd                                                        ONLINE       0     0     0
      disk/by-id/ata-Corsair_CMFSSD64-D1_S92O8T9T1F66W3MZ3O11  ONLINE       0     0     0
      disk/by-id/ata-Corsair_CMFSSD64-D1_T30O5KI2747KJJ7DZG91  ONLINE       0     0     0

errors: No known data errors

$ zfs set compression=on ssd
$ cd /tmp # my tmp is on tmpfs

$ # sorry, didn't time:
$ dd if=/dev/urandom of=blob.bin bs=1m count=1024

$ # now for the fun:
$ pv blob.bin > /ssd/fast


Guess what: steady write throughput (synthetetic test) yeilds 68Mb/s _on average_. Using zpool iostat I've spotted a peak off 91Mb/s (whoah).

This is totally acceptable for me! Needless to say, I'll run a config on ext4/BIOS RAID0 and a setup on NILFS (dual linear SATA mode) for comparisons before I make up my mind, but this is no longer a stopper for me in terms of performance.

Now for some even less representative fun:

root@intrepid:/tmp/work# pv /ssd/fast | md5sum
   1GB 0:00:03 [ 260MB/s] [=============>] 100%           
f4a8bb6266eed950e8c76163fa092a3b  -

root@intrepid:/tmp/work# echo 1 > /proc/sys/vm/drop_caches
root@intrepid:/tmp/work# pv /ssd/fast | md5sum
   1GB 0:00:05 [ 184MB/s] [============>] 100%           
f4a8bb6266eed950e8c76163fa092a3b  -


Not bad AT ALL (sorry for shouting)!

VERSIONS USED:
root@intrepid:/tmp/work# dpkg --status zfs-fuse
Package: zfs-fuse
Status: install ok installed
Priority: optional
Section: multiverse/admin
Installed-Size: 4160
Maintainer: Filip Brcic <br...@gna.org>
Architecture: i386
Version: 0.5.1-1ubuntu5


root@intrepid:/tmp/work# lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 9.04
Release:    9.04
Codename:    jaunty
root@intrepid:/tmp/work# uname -a
Linux intrepid.sehe.nl 2.6.28-15-server #52-Ubuntu SMP Wed Sep 9 11:50:50 UTC 2009 i686 GNU/Linux

Yes I know my hostname is a misnomer.
Note I'm using a server kernel image because it will access 8Gb or RAM without tweaking.

Seth

sghe...@hotmail.com

unread,
Oct 8, 2009, 5:01:36 PM10/8/09
to zfs-...@googlegroups.com
if anyone notices that bs=1m is illegal for GNU dd... you are right. I actually used sdd but no one uses it, and neither should I since the speed/progress reporting it does sucks. pv (pipe view) to the rescue!

By the way, I ran another test with compression off:

root@intrepid:/tmp# pv blob.bin > /ssd/faster
   1GB 0:00:14 [72.1MB/s] [====>] 100%           
root@intrepid:/tmp# pv blob.bin > /ssd/nocompress/fast
   1GB 0:00:16 [63.5MB/s] [====>] 100%           
root@intrepid:/tmp# zfs list -o name,compression,compressratio
NAME            COMPRESS  RATIO
ssd                   on  1.00x
ssd/nocompress       off  1.00x

Obviously pure randomness should be uncompressible. However, the apparent timing difference did surprise me a little. It could be that it is entirely within the margin of measurement error...

Jonathan Schmidt

unread,
Oct 8, 2009, 5:09:05 PM10/8/09
to zfs-...@googlegroups.com
Hmm yeah, that's weird. I could understand compression=on speeding
things up if the data is actually compressible, but I think you are just
looking at the noise. Or maybe your SSD performance is degrading, as
you are watching! (just kidding, kinda)

On 10/8/2009 2:01 PM, sghe...@hotmail.com wrote:
> if anyone notices that bs=1m is illegal for GNU dd... you are right. I
> actually used sdd but no one uses it, and neither should I since the
> speed/progress reporting it does sucks. pv (pipe view) to the rescue!
>
> By the way, I ran another test with compression off:
>

> *root@intrepid:/tmp# pv blob.bin > /ssd/faster
> * 1GB 0:00:14 [72.1MB/s] [====>] 100%


> root@intrepid:/tmp# pv blob.bin > /ssd/nocompress/fast
> 1GB 0:00:16 [63.5MB/s] [====>] 100%
> root@intrepid:/tmp# zfs list -o name,compression,compressratio
> NAME COMPRESS RATIO
> ssd on 1.00x
> ssd/nocompress off 1.00x
>
> Obviously pure randomness should be uncompressible. However, the
> apparent timing difference did surprise me a little. It could be that it
> is entirely within the margin of measurement error...
>
> sghe...@hotmail.com wrote:
>> I'm so happy. Just received and installed twice the Corsair Extreme
>> 64G SSDs (rated raw r/w 240/135Mb/s by manufacturer).
>>
>> Guess what... first thing I did:
>>
>> ------
>> PS. Worth noting I'm on OS and package versions stated below
>> Anyone have experience tuning ZFS for SSD?
>> ------
>>

>> *$ zpool status


>> $ zpool create ssd /dev/disk/by-id/ata-Corsair_CMFSSD64-D1_[ST]*
>> * pool: ssd
>> state: ONLINE
>> scrub: none requested
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> ssd ONLINE 0 0 0
>> disk/by-id/ata-Corsair_CMFSSD64-D1_S92O8T9T1F66W3MZ3O11 ONLINE 0 0 0
>> disk/by-id/ata-Corsair_CMFSSD64-D1_T30O5KI2747KJJ7DZG91 ONLINE 0 0 0
>>
>> errors: No known data errors
>>

>> *$ zfs set compression=on ssd


>> $ cd /tmp # my tmp is on tmpfs
>>
>> $ # sorry, didn't time:
>> $ dd if=/dev/urandom of=blob.bin bs=1m count=1024
>>
>> $ # now for the fun:
>> $ pv blob.bin > /ssd/fast

>> *


>> Guess what: steady write throughput (synthetetic test) yeilds 68Mb/s
>> _on average_. Using zpool iostat I've spotted a peak off 91Mb/s (whoah).
>>
>> This is totally acceptable for me! Needless to say, I'll run a config
>> on ext4/BIOS RAID0 and a setup on NILFS (dual linear SATA mode) for
>> comparisons before I make up my mind, but this is no longer a stopper
>> for me in terms of performance.
>>
>> Now for some even less representative fun:
>>

>> *root@intrepid:/tmp/work# pv /ssd/fast | md5sum
>> * 1GB 0:00:03 [ 260MB/s] [=============>] 100%
>> f4a8bb6266eed950e8c76163fa092a3b -
>>
>> *root@intrepid:/tmp/work# echo 1 > /proc/sys/vm/drop_caches


>> root@intrepid:/tmp/work# pv /ssd/fast | md5sum

>> * 1GB 0:00:05 [ 184MB/s] [============>] 100%
>> f4a8bb6266eed950e8c76163fa092a3b -
>> *
>> *Not bad AT ALL (sorry for shouting)!
>>
>> *VERSIONS USED:**


>> root@intrepid:/tmp/work# dpkg --status zfs-fuse

>> *Package: zfs-fuse


>> Status: install ok installed
>> Priority: optional
>> Section: multiverse/admin
>> Installed-Size: 4160
>> Maintainer: Filip Brcic <br...@gna.org>
>> Architecture: i386
>> Version: 0.5.1-1ubuntu5

>> *
>>
>> root@intrepid:/tmp/work# lsb_release -a
>> *No LSB modules are available.


>> Distributor ID: Ubuntu
>> Description: Ubuntu 9.04
>> Release: 9.04
>> Codename: jaunty

>> *root@intrepid:/tmp/work# uname -a
>> *Linux intrepid.sehe.nl 2.6.28-15-server #52-Ubuntu SMP Wed Sep 9


>> 11:50:50 UTC 2009 i686 GNU/Linux

>> *
>> /Yes I know my hostname is a misnomer. /*Note I'm using a server


>> kernel image because it will access 8Gb or RAM without tweaking.
>>
>> Seth
>>
>>
>
>

> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google
> Groups "zfs-fuse" group.
> To post to this group, send email to zfs-...@googlegroups.com
> To unsubscribe from this group, send email to
> zfs-fuse+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/zfs-fuse?hl=en
> -~----------~----~----~----~------~----~------~--~---
>


sghe...@hotmail.com

unread,
Oct 8, 2009, 5:42:16 PM10/8/09
to zfs-...@googlegroups.com
Jonathan Schmidt wrote:
> Hmm yeah, that's weird. I could understand compression=on speeding
> things up if the data is actually compressible, but I think you are
> just looking at the noise
/dev/urandom sort of is required to be noise...

> Or maybe your SSD performance is degrading, as you are watching!
> (just kidding, kinda)
Cynical joke much appreciated. Obviously I've carefully selected my SSD
with respect to this kind of behaviour, and I'm currently tuning/reading
up on exactly how to setup my rig for maximum pleasure/flexibility.

I'll keep you posted

sghe...@hotmail.com

unread,
Oct 8, 2009, 6:27:35 PM10/8/09
to zfs-...@googlegroups.com
Some non-ZFS numbers for comparison. Using lvm2 backing, also comparing performance of striped (128k stripes) vs. linear volumes.

The fs selected was ext2 for the sake of simplicity and _reference_ value. I formatted both the linear and the striped volume using stripe-width of 32 blocks (bs=4096). That way the comparison will not be skewed.

It turns out the write performance of linear SSD compared to ZFS-fuse striped is roughly 1.5x higher. Not a nice number, but nothing unexpected. After all, I was hoping SSD would make zfs-fuse performance _acceptable_ for me.

Obviously, striped write performance is totally out-of-this-world. Let's see... that is roughly 3 times as fast. Strangely enough it seems to be a little _more_ than twice the throughput of when writing to a lone SSD. But again that seems to be within the error margin here.

The read performance is where ZFS shines! ZFS across two SSDs was quicker in all cases. Even compared to plain and simple striped ext2! I'm sure this will have to do with the prefetch algorithm in place with ZFS, which is lacking from ext2? As could be gleaned form the first shouting post, already the read throughput of ZFS (~184 Mb/s) is approaching (broadly speaking) the speed at which a singlethreaded MD5-summer can even process the data (ok still at ~260Mb/s)

Note that this is all about synthetic tests doing large sustained reads/writes. I've seen some of the most pathetic performance of zfs-fuse when doing many small reads... [2]

# checked: /tmp/largeblob is positively cached buffered, oh and on tmpfs anyway ...
# write to linear volume
root@intrepid:~/SSDTEST# pv /tmp/largeblob > linear/fast
1.31GB 0:00:13 [97.1MB/s]

# write to striped volume
root@intrepid:~/SSDTEST# pv /tmp/largeblob > striped/fast
1.31GB 0:00:06 [ 199MB/s]

# read from linear volume
root@intrepid:~/SSDTEST# echo 1 > /proc/sys/vm/drop_caches
root@intrepid:~/SSDTEST# pv linear/fast | md5sum
1.31GB 0:00:11 [ 116MB/s]
bdb210cf6a38d7df726d759b50853afa  -

# read from striped volume
root@intrepid:~/SSDTEST# echo 1 > /proc/sys/vm/drop_caches
root@intrepid:~/SSDTEST# pv striped/fast | md5sum
1.31GB 0:00:09 [ 148MB/s]
bdb210cf6a38d7df726d759b50853afa  -


The setup was made as follows:

# (obviously: zpool destroy ssd [1])
# mark the ssd disks for lvm2 now:
$ pvcreate /dev/sd[bd]
$ vgcreate ssd /dev/sd[bd]
$ lvcreate -n linear -L 32g ssd
$ lvcreate -i2 -I128 -n striped -L 32g ssd
$ mkfs.ext2 -E stripe-width=32 -b 4096 -L ssd_lin /dev/ssd/linear
$ mkfs.ext2 -E stripe-width=32 -b 4096 -L ssd_stripe /dev/ssd/striped
$ mkdir -p SSDTEST/{striped,linear}
$ cd SSDTEST/
$ mount LABEL=ssd_lin -o noatime linear/
$ mount LABEL=ssd_stripe -o noatime striped/


[1] I'm so happy computer programs will always interpret their input _formally_ !
[2] anyone ever tried running /home on ZFS and experienced a near standstill when running .wine programs? Well, the problem lies in the middle here (technical design compromises with wine) but it was sure painful to see just how pathological the performance of zfs-fuse became compared to just about any other linux FS.

sghe...@hotmail.com

unread,
Oct 9, 2009, 9:51:55 AM10/9/09
to zfs-...@googlegroups.com
I repeated some of the striping performance tests using onboard Intel (fake)raid instead of lvm striping. IE I created a zfs pool on the BIOS raidset (single vdev) and a linear lvm volume on the the same disk. I kept the stripe-size at the same recommended 128k and the fs for lvm at ext2/4k blocks/32 blocks stripe-width (see before). [1] Raw figures, below.

CONCLUSIONS:
Biggest observation: I may have done an incorrect measurement of the max throughput of md5sum on my CPU ... I switched the read test to > /dev/null instead of | md5sum. To my horror/surprise I got some read speeds far in excess of my md5sum speed... eeck. It seems I'll have to check the accuracy of my reported read speeds...  :(

It appears that write performance on ZFS suffers a bit compared to letting zfs-fuse manage the striping internally: drop from ~68 to ~57 Mb/s (when writing). The (tweaked) ext2 approach suffers _more_, dropping from ~199 to ~151 Mb/s. The takeaway from this test is that I'll probably end up ditching fake-raid, since it looses performance and flexible management. The only upside would be that I would be able to dual boot a windows install on it in the future. Nice, but not enough to make these sacrifices for....

On the read side of things I'm a a bit confused now (see 'obervation' below). Read spead of ZFS on fakeraid might anything from _twice as fast_ (sic?!) to _slower_ by unknown amount. On the ext2 side of things the same thing goes. I can't explain ext2 on fakeraid0 being faster by ~264 to ~148 Mb/s compared to lvm striping with the same specs... I'll go back to lvm stripes on non-fakeraid0 for verification without the md5sum goofup

The good news is: ZFS read performance really champions everything, averaging at ~355Mb/s. (!)
This was a bit too awesome for my taste, I didn't quite trust things weren't just cached/buffered somewhere. I tried to minimize that chance by exporting/importing the pool before performing the read tests. No difference however... So unless someone points me to the obvious mistake...?

By comparison, linear lvm2 on the same fake-raid0 is only meager: 'just' 264Mb/s read speed :)

Also note that the assymetry of read vs write performance is extreme with ZFS (single vdev on fake-raid0): ~57 vs ~355 Mb/s Whoah: roughly a factor 7.
The chasm is not at all that extreme using ext2 on linear lvm2 on fake-raid0: ~151 vs 264 Mb/s, not even a factor 2?

It seems to me things could be tuned on the ZFS write side. Note that these are new SSDs and the blocks used for these test have never been written to before. So the classic 'MLC write degradation' arguments should be ruled out for these tests.
I'll simply have to go and see in karmic beta using the 0.6.x versions in order to test the big_writes patch. Somehow, I expect big wins especially for this test scenario... Does anyone know whether I still have to upgrade the version of fuse in order to do so? Or does karmic come with a modern-enough fuse module?

ZFS WRITE PERFORMANCE:

root@intrepid:/home/sehe/custom# ls *.iso | cpio -ov | pv > /ssd/uncompressed/fast
2.53GB 0:00:45 [57.5MB/s]

root@intrepid:/home/sehe/custom# ls *.iso | cpio -ov | pv > /ssd/compressed/fast
2.53GB 0:00:45 [56.4MB/s]

ZFS READ PERFORMANCE:

root@intrepid:~/SSDTEST# zpool export ssd
root@intrepid:~/SSDTEST# zpool import -d /dev/mapper/ ssd


root@intrepid:~/SSDTEST# echo 1 > /proc/sys/vm/drop_caches
root@intrepid:~/SSDTEST# pv /ssd/uncompressed/fast > /dev/null
2.53GB 0:00:07 [ 359MB/s]


root@intrepid:~/SSDTEST# echo 1 > /proc/sys/vm/drop_caches
root@intrepid:~/SSDTEST# pv /ssd/compressed/fast > /dev/null
2.53GB 0:00:07 [ 355MB/s]


LVM WRITE PERFORMANCE:

root@intrepid:/home/sehe/custom# ls *.iso | cpio -ov | pv > ~/SSDTEST/bios_striped/fast
2.53GB 0:00:17 [ 151MB/s]

LVM READ PERFORMANCE:

root@intrepid:~/SSDTEST# echo 1 > /proc/sys/vm/drop_caches
root@intrepid:~/SSDTEST# pv bios_striped/fast > /dev/null
2.53GB 0:00:09 [ 264MB/s]


NOTES ON SETUP:

[1] I changed the data policy from 'random generated blob' to 'assorted isos'. For convenience I pipe directly from cpio. The cpio overhead is negligeable:

root@intrepid:/home/sehe/custom# pv *.iso > /ssd/uncompressed/fast
2.53GB 0:00:43 [  60MB/s]


I picked ISO's because, although not random, they should not normally be easily compressed because the ISO format is compressed by default. Check (after copying the data, of course):

root@intrepid:~/SSDTEST# zfs list -o name,compression,compressratio
NAME              COMPRESS  RATIO
ssd                    off  1.00x
ssd/compressed          on  1.00x
ssd/uncompressed       off  1.00x

Full details of zpool/fs and lvm options:

# note that isw_bgfdegeihb_SSD is a raid0 set of my 2 new Corsair X64 SSDs; partitions 6 and 7 both amount to 20g of unformatted space
zpool create ssd /dev/mapper/isw_bgfdegeihb_SSD6
zfs create -o compression=on ssd/compressed
zfs create -o compression=off ssd/uncompressed

pvcreate /dev/mapper/isw_bgfdegeihb_SSD7
vgcreate ssd /dev/mapper/isw_bgfdegeihb_SSD7
lvcreate ssd -n bios_striped_and_lvm_linear -L20g
mkfs.ext2 -E stripe-width=32 -b 4096 -L bios_stripe /dev/ssd/bios_striped_and_lvm_linear
mkdir -p SSDTEST/bios_striped
mount -o noatime /dev/ssd/bios_striped_and_lvm_linear SSDTEST/bios_striped

Jonathan Schmidt

unread,
Oct 9, 2009, 11:50:35 AM10/9/09
to zfs-...@googlegroups.com
Some quick comments:

- ZFS shouldn't be faster than anything at full sequential reading. I
agree -- that's suspicious, but I can't see anything obviously wrong
with your setup. Well, maybe one thing. You should try to run tests
that take minutes or longer to complete. The difference between 6
seconds and 7 seconds is technically over 10%, but probably
insignificant. Increase your data size by 10x or 100x if possible.

- Have you tried any random reads/writes or mixed traffic?

- ISO is not a compressed format, but often the data inside an ISO is
compressed. Checking compressratio at 1.00x is enough to prove that the
data is incompressible, so that's fine.

- "fake-RAID" is usually not worth it, as you have shown below.

- I was previously suspicious if md5sum was limiting your performance,
but I didn't mention it. /dev/null is pretty fast though :D

Jonathan

On 10/9/2009 6:51 AM, sghe...@hotmail.com wrote:
> I repeated some of the striping performance tests using onboard Intel
> (fake)raid instead of lvm striping. IE I created a zfs pool on the BIOS
> raidset (single vdev) and a linear lvm volume on the the same disk. I
> kept the stripe-size at the same recommended 128k and the fs for lvm at
> ext2/4k blocks/32 blocks stripe-width (see before). [1] Raw figures, below.
>

> *CONCLUSIONS:
> *Biggest observation: I may have done an incorrect measurement of the


> max throughput of md5sum on my CPU ... I switched the read test to >
> /dev/null instead of | md5sum. To my horror/surprise I got some read
> speeds far in excess of my md5sum speed... eeck. It seems I'll have to
> check the accuracy of my reported read speeds... :(
>
> It appears that write performance on ZFS suffers a bit compared to
> letting zfs-fuse manage the striping internally: drop from ~68 to ~57
> Mb/s (when writing). The (tweaked) ext2 approach suffers _more_,
> dropping from ~199 to ~151 Mb/s. The takeaway from this test is that
> I'll probably end up ditching fake-raid, since it looses performance and
> flexible management. The only upside would be that I would be able to
> dual boot a windows install on it in the future. Nice, but not enough to
> make these sacrifices for....
>
> On the read side of things I'm a a bit confused now (see 'obervation'
> below). Read spead of ZFS on fakeraid might anything from _twice as
> fast_ (sic?!) to _slower_ by unknown amount. On the ext2 side of things
> the same thing goes. I can't explain ext2 on fakeraid0 being faster by
> ~264 to ~148 Mb/s compared to lvm striping with the same specs... I'll
> go back to lvm stripes on non-fakeraid0 for verification without the
> md5sum goofup

> *
> *The good news is: ZFS read performance really champions everything,


> averaging at ~355Mb/s. (!)
> This was a bit too awesome for my taste, I didn't quite trust things
> weren't just cached/buffered somewhere. I tried to minimize that chance
> by exporting/importing the pool before performing the read tests. No
> difference however... So unless someone points me to the obvious mistake...?
>
> By comparison, linear lvm2 on the same fake-raid0 is only meager: 'just'
> 264Mb/s read speed :)
>
> Also note that the assymetry of read vs write performance is extreme
> with ZFS (single vdev on fake-raid0): ~57 vs ~355 Mb/s Whoah: roughly a
> factor 7.
> The chasm is not at all that extreme using ext2 on linear lvm2 on
> fake-raid0: ~151 vs 264 Mb/s, not even a factor 2?
>
> It seems to me things could be tuned on the ZFS write side. Note that
> these are new SSDs and the blocks used for these test have never been
> written to before. So the classic 'MLC write degradation' arguments
> should be ruled out for these tests.
> I'll simply have to go and see in karmic beta using the 0.6.x versions
> in order to test the big_writes patch. Somehow, I expect big wins
> especially for this test scenario... Does anyone know whether I still
> have to upgrade the version of fuse in order to do so? Or does karmic
> come with a modern-enough fuse module?
>

> *ZFS WRITE PERFORMANCE:
> *


> root@intrepid:/home/sehe/custom# ls *.iso | cpio -ov | pv >
> /ssd/uncompressed/fast
> 2.53GB 0:00:45 [57.5MB/s]
>
> root@intrepid:/home/sehe/custom# ls *.iso | cpio -ov | pv >
> /ssd/compressed/fast
> 2.53GB 0:00:45 [56.4MB/s]
>

> *ZFS READ PERFORMANCE:
> *


> root@intrepid:~/SSDTEST# zpool export ssd
> root@intrepid:~/SSDTEST# zpool import -d /dev/mapper/ ssd
>
> root@intrepid:~/SSDTEST# echo 1 > /proc/sys/vm/drop_caches
> root@intrepid:~/SSDTEST# pv /ssd/uncompressed/fast > /dev/null
> 2.53GB 0:00:07 [ 359MB/s]
>
> root@intrepid:~/SSDTEST# echo 1 > /proc/sys/vm/drop_caches
> root@intrepid:~/SSDTEST# pv /ssd/compressed/fast > /dev/null
> 2.53GB 0:00:07 [ 355MB/s]
>

> *LVM WRITE PERFORMANCE:*


>
> root@intrepid:/home/sehe/custom# ls *.iso | cpio -ov | pv >
> ~/SSDTEST/bios_striped/fast
> 2.53GB 0:00:17 [ 151MB/s]
>

> *LVM READ PERFORMANCE:
> *


> root@intrepid:~/SSDTEST# echo 1 > /proc/sys/vm/drop_caches
> root@intrepid:~/SSDTEST# pv bios_striped/fast > /dev/null
> 2.53GB 0:00:09 [ 264MB/s]
>
>

> *NOTES ON SETUP:*


>
> [1] I changed the data policy from 'random generated blob' to 'assorted
> isos'. For convenience I pipe directly from cpio. The cpio overhead is
> negligeable:
>
> root@intrepid:/home/sehe/custom# pv *.iso > /ssd/uncompressed/fast
> 2.53GB 0:00:43 [ 60MB/s]
>
> I picked ISO's because, although not random, they should not normally be
> easily compressed because the ISO format is compressed by default. Check
> (after copying the data, of course):
>
> root@intrepid:~/SSDTEST# zfs list -o name,compression,compressratio
> NAME COMPRESS RATIO
> ssd off 1.00x
> ssd/compressed on 1.00x
> ssd/uncompressed off 1.00x
>

> *Full details of zpool/fs and lvm options:*

>> *# write to linear volume


>> root@intrepid:~/SSDTEST# pv /tmp/largeblob > linear/fast

>> *1.31GB 0:00:13 [97.1MB/s]
>>
>> *# write to striped volume


>> root@intrepid:~/SSDTEST# pv /tmp/largeblob > striped/fast

>> *1.31GB 0:00:06 [ 199MB/s]
>>
>> *# read from linear volume


>> root@intrepid:~/SSDTEST# echo 1 > /proc/sys/vm/drop_caches
>> root@intrepid:~/SSDTEST# pv linear/fast | md5sum

>> *1.31GB 0:00:11 [ 116MB/s]
>> bdb210cf6a38d7df726d759b50853afa -
>>
>> *# read from striped volume


>> root@intrepid:~/SSDTEST# echo 1 > /proc/sys/vm/drop_caches
>> root@intrepid:~/SSDTEST# pv striped/fast | md5sum

>> *1.31GB 0:00:09 [ 148MB/s]


>> bdb210cf6a38d7df726d759b50853afa -
>>
>> The setup was made as follows:
>>
>> # (obviously: zpool destroy ssd [1])
>> # mark the ssd disks for lvm2 now:

>> *$ pvcreate /dev/sd[bd]


>> $ vgcreate ssd /dev/sd[bd]
>> $ lvcreate -n linear -L 32g ssd
>> $ lvcreate -i2 -I128 -n striped -L 32g ssd
>> $ mkfs.ext2 -E stripe-width=32 -b 4096 -L ssd_lin /dev/ssd/linear
>> $ mkfs.ext2 -E stripe-width=32 -b 4096 -L ssd_stripe /dev/ssd/striped
>> $ mkdir -p SSDTEST/{striped,linear}
>> $ cd SSDTEST/
>> $ mount LABEL=ssd_lin -o noatime linear/
>> $ mount LABEL=ssd_stripe -o noatime striped/

>> *

sghe...@hotmail.com

unread,
Oct 9, 2009, 1:51:39 PM10/9/09
to zfs-...@googlegroups.com
Jonathan Schmidt wrote:
> Some quick comments:
>
> - ZFS shouldn't be faster than anything at full sequential reading. I
> agree -- that's suspicious, but I can't see anything obviously wrong
> with your setup. Well, maybe one thing. You should try to run tests
> that take minutes or longer to complete. The difference between 6
> seconds and 7 seconds is technically over 10%, but probably
> insignificant. Increase your data size by 10x or 100x if possible.
I'm hesitant to put my precious fresh SSD blocks to such benchmarks
(especially seeing there is not yet a support method of wiping the
cells, like the OCZ series already have). I cleverly managed my SSDs
physically in 20GB chunks so that I could run independent 'fresh' tests.
I'm already running out of room. Since 2.5gb runs into the seconds,
nothing will ever run into minutes, unless you deliberately intend to
continuous re-write the same cells. With SSD, that is actually the last
thing you want to do :)
>
> - Have you tried any random reads/writes or mixed traffic?
For the same reasons, no. I'm looking into maybe running Bonnie++ tests
in read-only mode later.
>
> - ISO is not a compressed format, but often the data inside an ISO is
> compressed. Checking compressratio at 1.00x is enough to prove that
> the data is incompressible, so that's fine.
Hmmm. I'll need to check that on Wikipedia. I was pretty sure it is
inherently compressed.
>
> - "fake-RAID" is usually not worth it, as you have shown below.
>
> - I was previously suspicious if md5sum was limiting your performance,
> but I didn't mention it.
I did ! Apparently though, I was wrong when I measured an md5 throughput
of ~260 Mb. I don't remember how I got that number, but it turned out
pretty consistently at 184Mb/s the last time I checked, which pretty
much invalidates those earlier read speeds.
Now an interesting thought to entertain: maybe md5sum's performance
characteristic is wildly dependent on the input values? In that case,
there could be a security whole in there, as it may be possible to
induce certain information about the input just by measuring how much
time (relatively) is spent calculating the checksum of it :) In other
words: I don't suppose this could be the case.
> /dev/null is pretty fast though :D
Have you mounted /dev/null on tmpfs :) Hehehe

Jonathan Schmidt

unread,
Oct 9, 2009, 3:48:27 PM10/9/09
to zfs-...@googlegroups.com
>> - ZFS shouldn't be faster than anything at full sequential reading. I
>> agree -- that's suspicious, but I can't see anything obviously wrong
>> with your setup. Well, maybe one thing. You should try to run tests
>> that take minutes or longer to complete. The difference between 6
>> seconds and 7 seconds is technically over 10%, but probably
>> insignificant. Increase your data size by 10x or 100x if possible.
>
> I'm hesitant to put my precious fresh SSD blocks to such benchmarks
> (especially seeing there is not yet a support method of wiping the
> cells, like the OCZ series already have). I cleverly managed my SSDs
> physically in 20GB chunks so that I could run independent 'fresh' tests.
> I'm already running out of room. Since 2.5gb runs into the seconds,
> nothing will ever run into minutes, unless you deliberately intend to
> continuous re-write the same cells. With SSD, that is actually the last
> thing you want to do :)

You can run an ATA "Secure Erase" command to refresh the entire SSD back
to (like-)new conditions.

Partitioning the drive into 20GB chunks actually does nothing, due to
the way that the internal flash indirection system works. Writing to
the same LBA 100x over, or 100 sequential LBAs, or 100 random LBAs have
roughly the same effect internally.

>> - Have you tried any random reads/writes or mixed traffic?
> For the same reasons, no. I'm looking into maybe running Bonnie++ tests
> in read-only mode later.

See above. Sequential or random I/O will cause the same sort of wear on
the SSD. All this benchmarking you are doing won't significantly reduce
the overall lifespan of your drives, but I understand your hesitance
anyway. Note that writing wears out the drive much quicker than
reading, so perhaps focus your tests on read performance :)

>> - ISO is not a compressed format, but often the data inside an ISO is
>> compressed. Checking compressratio at 1.00x is enough to prove that
>> the data is incompressible, so that's fine.
> Hmmm. I'll need to check that on Wikipedia. I was pretty sure it is
> inherently compressed.

"They are stored in an uncompressed format."
- http://en.wikipedia.org/wiki/ISO_image

No harm done, your ISO's obviously *contained* compressed data, even if
the format itself isn't compressed.

>> - "fake-RAID" is usually not worth it, as you have shown below.
>>
>> - I was previously suspicious if md5sum was limiting your performance,
>> but I didn't mention it.
> I did ! Apparently though, I was wrong when I measured an md5 throughput
> of ~260 Mb. I don't remember how I got that number, but it turned out
> pretty consistently at 184Mb/s the last time I checked, which pretty
> much invalidates those earlier read speeds.
> Now an interesting thought to entertain: maybe md5sum's performance
> characteristic is wildly dependent on the input values? In that case,
> there could be a security whole in there, as it may be possible to
> induce certain information about the input just by measuring how much
> time (relatively) is spent calculating the checksum of it :) In other
> words: I don't suppose this could be the case.

Timing attacks are common in security systems. If you want a challenge,
try looking into differential power analysis methods of plaintext/key
recovery from smart cards, etc. It's crazy stuff!

>> /dev/null is pretty fast though :D
> Have you mounted /dev/null on tmpfs :) Hehehe

Yikes!

sghe...@hotmail.com

unread,
Nov 7, 2009, 5:49:10 PM11/7/09
to zfs-...@googlegroups.com
Jonathan Schmidt wrote:
> You can run an ATA "Secure Erase" command to refresh the entire SSD
> back to (like-)new conditions.
I've spent many hours over the last period trying to find a tools that
would actually let me do that. Burning ISOs, getting FreeDOS, installing
windows, installing hdparm from source snapshot, browsing many pages of
docu, forums and spec: all seems to be down the ATA "password" being
secret for OCZ X64 drives. I can't SECURE-ERASE my disks to date. Do you
have any concrete experience with that?

>
> Partitioning the drive into 20GB chunks actually does nothing, due to
> the way that the internal flash indirection system works. Writing to
> the same LBA 100x over, or 100 sequential LBAs, or 100 random LBAs
> have roughly the same effect internally.
Yeah I can see how that works now.

sghe...@hotmail.com

unread,
Nov 7, 2009, 5:55:39 PM11/7/09
to zfs-...@googlegroups.com
sghe...@hotmail.com wrote:
> [2] anyone ever tried running /home on ZFS and experienced a near
> standstill when running .wine programs? Well, the problem lies in the
> middle here (technical design compromises with wine) but it was sure
> painful to see just how pathological the performance of zfs-fuse
> became compared to just about any other linux FS.
It could be most interesting to revisit this little statistic since we
have now a more optimized version of zfs-fuse (see my latest benchmark
results at
http://groups.google.com/group/zfs-fuse/msg/ff469c1dd196cfc1?hl=en). And
besides, ZFS has supported a (then-new, for me at least) casesensitivity
property. This should according to
http://wiki.winehq.org/CaseInsensitiveFilenames change the picture
completely for Wine.

Emmanuel Anne

unread,
Nov 8, 2009, 12:31:06 AM11/8/09
to zfs-...@googlegroups.com
Yeah but only if you succeed to tell wine to forget about case sensitivity which doesn't seem to be so easy.
Anyway there were optimizations in wine for that too, so I don't think you'll run into a stand still again, I encountered this problem only for very specific programs (and even if it's not exactly a stand still, it's still horribly slow).
It's really a wine problem more than a fs performance problem though.

Mike Hommey

unread,
Nov 8, 2009, 2:55:02 AM11/8/09
to zfs-...@googlegroups.com
On Sat, Nov 07, 2009 at 11:49:10PM +0100, sghe...@hotmail.com wrote:
>
> Jonathan Schmidt wrote:
> > You can run an ATA "Secure Erase" command to refresh the entire SSD
> > back to (like-)new conditions.
> I've spent many hours over the last period trying to find a tools that
> would actually let me do that. Burning ISOs, getting FreeDOS, installing
> windows, installing hdparm from source snapshot, browsing many pages of
> docu, forums and spec: all seems to be down the ATA "password" being
> secret for OCZ X64 drives. I can't SECURE-ERASE my disks to date. Do you
> have any concrete experience with that?

http://ata.wiki.kernel.org/index.php/ATA_Secure_Erase

You have both the hdparm way, that won't work if your BIOS freezes the
disks security features, and a link to HDDErase that is a DOS tool that
can deal with that (unfreezing will require a reboot, though, so you'll
have to boot on HDDErase twice)

sghe...@hotmail.com

unread,
Nov 8, 2009, 6:05:13 AM11/8/09
to zfs-...@googlegroups.com
Emmanuel Anne wrote:
> Yeah but only if you succeed to tell wine to forget about case
> sensitivity which doesn't seem to be so easy.
I asked this precise question at
http://bugs.winehq.org/show_bug.cgi?id=3817 yesterday :). Although I
expect to get an answer like: there is no option to tell wine. But
proving the existence of a specific filename on an underlying CI
filesystem will automatically be optimal (the first try will always
succeed). Proving the absense of such a file would still be slower than
optimal since Wine will assume it needs to check for any possible
spelling, when in fact it needs not.
> Anyway there were optimizations in wine for that too, so I don't think
> you'll run into a stand still again, I encountered this problem only
> for very specific programs (and even if it's not exactly a stand
> still, it's still horribly slow).
> It's really a wine problem more than a fs performance problem though.
Agreed
>
> 2009/11/7 sghe...@hotmail.com <mailto:sghe...@hotmail.com>
> <sghe...@hotmail.com <mailto:sghe...@hotmail.com>>

sghe...@hotmail.com

unread,
Nov 8, 2009, 6:12:44 AM11/8/09
to zfs-...@googlegroups.com
Mike Hommey wrote:
> http://ata.wiki.kernel.org/index.php/ATA_Secure_Erase
>
Great, hadn't found that page before. Qoute:
>> To successfully issue an ATA Security Erase command you need to first
>> set a user password. This step is omitted from almost all other
>> sources which describe how to secure erase with hdparm.
This seems only too true! I'll have a go. Only this time I'll need to
backup my existing install :)

> You have both the hdparm way, that won't work if your BIOS freezes the
> disks security features,
mmm my BIOS has no options in that area
> and a link to HDDErase that is a DOS tool that
> can deal with that (unfreezing will require a reboot, though, so you'll
> have to boot on HDDErase twice)
>
I had tried HDDErase before, but I broke down on the assumption I should
have found a pre-existing (factory?) password from somewhere :) I
suppose hdparm will do the job now I understand how the security
password is intended !

Thanks again, Mike
Seth
> >
>
>
>

Mike Hommey

unread,
Nov 8, 2009, 6:25:03 AM11/8/09
to zfs-...@googlegroups.com
On Sun, Nov 08, 2009 at 12:12:44PM +0100, sghe...@hotmail.com wrote:
>
> Mike Hommey wrote:
> > http://ata.wiki.kernel.org/index.php/ATA_Secure_Erase
> >
> Great, hadn't found that page before. Qoute:
> >> To successfully issue an ATA Security Erase command you need to first
> >> set a user password. This step is omitted from almost all other
> >> sources which describe how to secure erase with hdparm.
> This seems only too true! I'll have a go. Only this time I'll need to
> backup my existing install :)
>
> > You have both the hdparm way, that won't work if your BIOS freezes the
> > disks security features,
> mmm my BIOS has no options in that area

Usually, they don't have an option to disable that, but HDDErase somehow
manages to make the BIOS not freeze once at next boot.

Jonathan Schmidt

unread,
Nov 9, 2009, 2:29:30 PM11/9/09
to zfs-...@googlegroups.com
>>> http://ata.wiki.kernel.org/index.php/ATA_Secure_Erase
>>>
>> Great, hadn't found that page before. Qoute:
>>>> To successfully issue an ATA Security Erase command you need to first
>>>> set a user password. This step is omitted from almost all other
>>>> sources which describe how to secure erase with hdparm.
>> This seems only too true! I'll have a go. Only this time I'll need to
>> backup my existing install :)
>>
>>> You have both the hdparm way, that won't work if your BIOS freezes the
>>> disks security features,
>> mmm my BIOS has no options in that area
>
> Usually, they don't have an option to disable that, but HDDErase somehow
> manages to make the BIOS not freeze once at next boot.

I usually just hot-reboot the drives once Linux is booted (use a Live-CD
to be safe). Then the BIOS can't get its mucky paws into the ATA
security features and hdparm works beautifully.

- boot system from Ubuntu live CD
- "hdparm -I /dev/sdX" will show "frozen"
- unplug SATA data cable from drive
- unplug power from drive
- plug power into drive
- plug SATA data cable back in
- wait a few seconds
- "hdparm -I /dev/sdX" will now show "not frozen"
- hdparm --user-master u --security-set-pass test /dev/sdX
- hdparm --user-master u --security-erase test /dev/sdX

BOOM!

To be safe, I usually unplug all disks from the system except the one I
intend to erase.

HTH,

Jonathan

Mike Hommey

unread,
Nov 9, 2009, 3:20:08 PM11/9/09
to zfs-...@googlegroups.com
On Mon, Nov 09, 2009 at 11:29:30AM -0800, Jonathan Schmidt wrote:
> I usually just hot-reboot the drives once Linux is booted (use a Live-CD
> to be safe). Then the BIOS can't get its mucky paws into the ATA
> security features and hdparm works beautifully.
>
> - boot system from Ubuntu live CD
> - "hdparm -I /dev/sdX" will show "frozen"
> - unplug SATA data cable from drive
> - unplug power from drive
> - plug power into drive
> - plug SATA data cable back in
> - wait a few seconds
> - "hdparm -I /dev/sdX" will now show "not frozen"
> - hdparm --user-master u --security-set-pass test /dev/sdX
> - hdparm --user-master u --security-erase test /dev/sdX
>
> BOOM!

Yeah, I read that you could hot-replug disks to unfreeze them, except that
it's somehow not practical to do on a laptop...

Mike

sghe...@hotmail.com

unread,
Nov 9, 2009, 4:06:07 PM11/9/09
to zfs-...@googlegroups.com
Brilliant thread. Pity it's a bit off-topic for the list :) Thanks any way for sahring your experiences. This will come in handy shortly for me

Jonathan Schmidt

unread,
Nov 9, 2009, 4:37:48 PM11/9/09
to zfs-...@googlegroups.com

I don't see a problem with that. Unless your laptop was designed by
sadists (or Apple?), the HDD bay should be externally accessible and you
can hot plug the drive nearly as easily as a desktop. At most you
should just have to flip the laptop over briefly.

Mike Hommey

unread,
Nov 9, 2009, 4:47:49 PM11/9/09
to zfs-...@googlegroups.com
On Mon, Nov 09, 2009 at 01:37:48PM -0800, Jonathan Schmidt wrote:
> I don't see a problem with that. Unless your laptop was designed by
> sadists (or Apple?), the HDD bay should be externally accessible and you
> can hot plug the drive nearly as easily as a desktop. At most you
> should just have to flip the laptop over briefly.

You nailed it: to access the HDD, I need to remove the keyboard and
the palmrest. (and FWIW, some Apple models have very accessible HDDs)

Mike

sghe...@hotmail.com

unread,
Nov 10, 2009, 3:11:48 AM11/10/09
to zfs-...@googlegroups.com

> I don't see a problem with that. Unless your laptop was designed by
> sadists (or Apple?), the HDD bay should be externally accessible and you
> can hot plug the drive nearly as easily as a desktop. At most you
> should just have to flip the laptop over briefly.
>
If you care, I can post a video of my disassembling my Acer One,
replacing the harddisk and re-assembling. Note that it was not even
possible to disconnect the harddisk whithout unscrewing the
daughterboard. Wiggling the disk loose could be possible without
removing the motherboard, but reconnecting it would be almost impossible
in the blind (the connector is on the bottomside of the motherboard).

My Dell laptop could actually be ok...
> >
>
>
>

Reply all
Reply to author
Forward
0 new messages