I don't know the answer to the above, sorry.
> P.S. is it also true that I can better use /dev/disk/by-id instead of /
> dev/sdX? I think this can only be applied when using raw disks.
> correct?
No, the /dev/disk/by-id directory contains symlinks to each partition of
the disks, along with the disks themselves. So /dev/sda might be
/dev/disk/by-id/xxxxx and /dev/sda1 would be /dev/disk/by-id/xxxxx-part1.
That is 100% correct. If you do not partition your drives before adding
them to a zpool you will not be able to import or export your zpools
correctly to other os or if you have to rebuild your current os.
- Eric
----------------------------------------
Eric J. Krieger
Ubuntu Member and Altruistic Network Ninja.
Email: grammat...@ubuntu.com
Wiki: https://wiki.ubuntu.com/GrammatonCleric
NM LoCo: https://wiki.ubuntu.com/NewMexicoTeam
Random Quote:
"No matter what the game, no matter
what the rules, the same rules apply
to both sides!"
-Hoyle's Law
Are you giving a suggestion? Are you asking for documentation to be
updated? Are you questioning the advice you are getting? Are you asking
for new features? Are you simply curious?
As long as you just keep repeating 'but why' I think it might be a long
time before anyone will find the time to answer your potential question.
It is pretty clear that the answer isn't obvious (or you'd have 6 copies
of it by now) so ... in case of curiosity: the source is all yours!
("Use the sorce, Luke!")
Now, enough of this - I'd probably dive into the code myself if I had
the time tight now
Seth
Yep
> unicron wrote:
>>> That is 100% correct. If you do not partition your drives before adding
>>> them to a zpool you will not be able to import or export your zpools
>>> correctly to other os or if you have to rebuild your current os.
>>>
>>>
>> Okay, but why? Are raw zpool's in Solaris different?
>>
>>
> Unicorn, I think you are coming across more stubborn than you intend
> (:)) but right now I only feel the need to redirect the question 180
> degrees. Why does it matter to you?
It matters to me because I want to understand the system that I'm using.
> Are you giving a suggestion? Are you asking for documentation to be
> updated? Are you questioning the advice you are getting? Are you
> asking for new features? Are you simply curious?
Well, I'd really like to know, too. It was always my assumption that if
I used raw vdev's I'd not only be following ZFS best practices, but I'd
also be able to use the pool under FreeBSD or (Open)Solaris when I
eventually got some more powerful hardware for hosting my VMs. This
whole thing is coming as a bit of a shock, and making me wonder if I
should be heading down a totally different road (btrfs, dmraid, or
something else).
> As long as you just keep repeating 'but why' I think it might be a
> long time before anyone will find the time to answer your potential
> question. It is pretty clear that the answer isn't obvious (or you'd
> have 6 copies of it by now) so ... in case of curiosity: the source is
> all yours! ("Use the sorce, Luke!")
Well, c'mon now. If the answer _really_ isn't obvious it's pretty
unlikely that the source is going to reveal it easily. Someone who
understands this issue ought to at least be able to provide some brief,
initially-incomprehensible explanation that we can chew on until we get
it.
--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com
My guess is: there is one (lead) dev and he is currently busy (R.
Correia). He'll probably (*just guessing*) be answering along the lines
of: it was easier to implement vdev management against the kernel API
for partitions than meddling with raw block devices (and their locking),
at least from fuse. I might be wrong, but it seems like such a priority
call is very likely to happen when you wish to port a thing like this
and first-and-foremost want to have it running?
If we want to know... 2 options: wait for God from the cloud the answer
(Riccardo, is that you?) or look at the source.
Nah I don't think the source will be confusing. I just think no-one
reads it. (Sorry devs: I'm not ignoring your efforts here! I'm simply
exaggerating the lack of resources in order to illustrate my theory). If
you read it, chances are it'll readily jump out (perhaps some form of
comment inline).
If this wasn't incomprehensible than may I apologize by saying it wasn't
brief either LOL
Cheers
Seth
PS. Did I mention little of the above is based on facts, conversation,
prior or acquired knowledge? I am *just guessing*
>> Well, c'mon now. If the answer _really_ isn't obvious it's pretty
>> unlikely that the source is going to reveal it easily. Someone who
>> understands this issue ought to at least be able to provide some brief,
>> initially-incomprehensible explanation that we can chew on until we get
>> it.
>>
> I hear you. I hate it too :)
I guess I'll probably try serving ZFS from a FreeBSD VM, then.
> My guess is: there is one (lead) dev and he is currently busy (R.
> Correia). He'll probably (*just guessing*) be answering along the lines
> of: it was easier to implement vdev management against the kernel API
> for partitions than meddling with raw block devices (and their locking),
> at least from fuse. I might be wrong, but it seems like such a priority
> call is very likely to happen when you wish to port a thing like this
> and first-and-foremost want to have it running?
>
> If we want to know... 2 options: wait for God from the cloud the answer
> (Riccardo, is that you?) or look at the source.
> Nah I don't think the source will be confusing. I just think no-one
> reads it. (Sorry devs: I'm not ignoring your efforts here! I'm simply
> exaggerating the lack of resources in order to illustrate my theory). If
> you read it, chances are it'll readily jump out (perhaps some form of
> comment inline).
Yeah, but sorry, I don't know the first thing about kernel APIs for
partitions /or/ meddling with raw block devices. I did take a crawl
through the source but came up empty.
> If this wasn't incomprehensible than may I apologize by saying it wasn't
> brief either LOL
Yeah, you failed on both counts ;-)
Changelog:
hg log
Or see CHANGES file
(http://www.wizy.org/mercurial/zfs-fuse/trunk/file/008c531499cd/CHANGES)
The latest changes are recommended, and they are in the latest RPM
available from the Fedora repository. I am maintaining this package,
and would like as much feedback on as possible, therefore the shameless
plug :)
--
With kind regards,
Uwe Kubosch
Kubosch Consulting
Norway
Since you're using Ubuntu, why not use Filip Brcic's ppa?
https://wiki.ubuntu.com/ZFS/
http://ppa.launchpad.net/brcha/ubuntu/
I believe it uses changeset 375 from trunk
Sorry for taking so long to answer, I've been a bit busy as you may have
guessed.
On Qui, 2009-01-08 at 11:52 -0800, unicron wrote:
> sorry for all the why's, but I'm basically curious and I like to
> understand the filesystem I use (ofcourse to a certain limit). I
> understand that it is a question that few people can answer and I am
> very happy to wait for a while.
So here's what happens:
When you create a ZFS pool on a whole disk in Solaris (or you add a new,
whole disk to an existing pool), the zpool command will use libdiskmgt
to create an EFI/GPT disk label on the disk with a single large
partition (and also a very small one, but this is irrelevant), and then
it will use this partition to store the data.
However, in Linux, this doesn't happen because there is no libdiskmgt
and it would be hard to port it, so zfs-fuse will simply treat a raw
disk like any other block device (i.e. it will just write the data to
it, it won't create a GPT label like in Solaris).
The problem with the latter, is that Solaris doesn't like disks without
labels/partitions, so if you create your pool on a raw disk with
zfs-fuse, you won't be able to import it on Solaris.
I have no idea how FreeBSD handles pools on raw disks, though (but my
guess is that it would work).
My recommendation is to always create an msdos label on your disk,
because there are some EFI/GPT incompatibilities between Solaris and
Linux. You can do this with the 'fdisk' program.
BTW, I've worked on a patch to make zpool create EFI labels like in
Solaris, but with libparted instead of libdiskmgt.
If anyone is interested, it's here:
https://bugzilla.lustre.org/show_bug.cgi?id=14548
Unfortunately, a zpool command with this patch can't currently be
distributed because the libparted license (GPLv2) is incompatible with
the zpool license (CDDL).
> P.S. I'm currently using the 0.5.0 version from the ZFS-FUSE blog
> webpage. I've seen some technical discussions in this google group
> about changes that are made since 0.5.0 was released. Where can I find
> the latest repository? and is recommended to use it (as in does it fix
> critical things when using raidz)?
You can find a pointer to the Mercurial repository in the main web page:
http://www.wizy.org/wiki/ZFS_on_FUSE
There have been some fixes, one of which related to SCSI devices that
might be important to avoid corruption.
You can find the changelog in the repository by going to the link below,
and then clicking on 'files' and then 'CHANGES':
http://www.wizy.org/mercurial/zfs-fuse/trunk/
HTH,
Ricardo
> Hi,
>
> Sorry for taking so long to answer, I've been a bit busy as you may have
> guessed.
Thanks so much for following up.
> On Qui, 2009-01-08 at 11:52 -0800, unicron wrote:
>> sorry for all the why's, but I'm basically curious and I like to
>> understand the filesystem I use (ofcourse to a certain limit). I
>> understand that it is a question that few people can answer and I am
>> very happy to wait for a while.
>
> So here's what happens:
>
> When you create a ZFS pool on a whole disk in Solaris (or you add a new,
> whole disk to an existing pool), the zpool command will use libdiskmgt
> to create an EFI/GPT disk label on the disk with a single large
> partition (and also a very small one, but this is irrelevant), and then
> it will use this partition to store the data.
So, IIUC, you are saying that even though the official ZFS
recommendation is to use whole disks, when you do that, Solaris actually
partitions the disk under the covers anyway?
If so, why the emphasis on using whole disks? If not... I'm
really confused!
> However, in Linux, this doesn't happen because there is no libdiskmgt
> and it would be hard to port it, so zfs-fuse will simply treat a raw
> disk like any other block device (i.e. it will just write the data to
> it, it won't create a GPT label like in Solaris).
>
> The problem with the latter, is that Solaris doesn't like disks without
> labels/partitions, so if you create your pool on a raw disk with
> zfs-fuse, you won't be able to import it on Solaris.
>
> I have no idea how FreeBSD handles pools on raw disks, though (but my
> guess is that it would work).
You mean, you guess it would work without the label?
> My recommendation is to always create an msdos label on your disk,
> because there are some EFI/GPT incompatibilities between Solaris and
> Linux. You can do this with the 'fdisk' program.
So... Solaris can read msdos-labelled disks?
Yes.
> If so, why the emphasis on using whole disks? If not... I'm
> really confused!
Because when you use a whole disk in Solaris, ZFS will automatically
enable the disk write cache, because it knows that there is no other
filesystem there that could become damaged if the power fails (such as
the UFS filesystem).
So performance will be better and your disks will last longer.
> >> I have no idea how FreeBSD handles pools on raw disks, though (but my
> > guess is that it would work).
>
> You mean, you guess it would work without the label?
Yes, but it's only a guess.
> > My recommendation is to always create an msdos label on your disk,
> > because there are some EFI/GPT incompatibilities between Solaris and
> > Linux. You can do this with the 'fdisk' program.
>
> So... Solaris can read msdos-labelled disks?
Yes.
Cheers,
Ricardo
Ricardo M. Correia wrote:
> On Sex, 2009-01-09 at 14:13 -0500, David Abrahams wrote:
>
>>> When you create a ZFS pool on a whole disk in Solaris (or you add a new,
>>>
And I love the way your (portugese?) mailer wants to write 'On Sex' all
the time :) I suppose that is only funny on my side of the globe...
Regards,
Seth
> On Sex, 2009-01-09 at 14:13 -0500, David Abrahams wrote:
>> > When you create a ZFS pool on a whole disk in Solaris (or you add a new,
>> > whole disk to an existing pool), the zpool command will use libdiskmgt
>> > to create an EFI/GPT disk label on the disk with a single large
>> > partition (and also a very small one, but this is irrelevant), and then
>> > it will use this partition to store the data.
>>
>> So, IIUC, you are saying that even though the official ZFS
>> recommendation is to use whole disks, when you do that, Solaris actually
>> partitions the disk under the covers anyway?
>
> Yes.
>
>> If so, why the emphasis on using whole disks? If not... I'm
>> really confused!
>
> Because when you use a whole disk in Solaris, ZFS will automatically
> enable the disk write cache, because it knows that there is no other
> filesystem there that could become damaged if the power fails (such as
> the UFS filesystem).
>
> So performance will be better and your disks will last longer.
Okay, this is good news; I feel better about the partitioning
requirement now. To me it sounds like zfs-fuse should perhaps refuse to
use whole disks in the Linux way, or at least give a loud warning.
Two tangential questions:
* Do I get control over the disk write cache in Linux?
* If I have battery backup for the whole server, should I have to worry
about enabling the disk write cache even if I share disks between zfs
and some other FS?
>> >> I have no idea how FreeBSD handles pools on raw disks, though (but my
>> > guess is that it would work).
>>
>> You mean, you guess it would work without the label?
>
> Yes, but it's only a guess.
Roger.
>> > My recommendation is to always create an msdos label on your disk,
>> > because there are some EFI/GPT incompatibilities between Solaris and
>> > Linux. You can do this with the 'fdisk' program.
>>
>> So... Solaris can read msdos-labelled disks?
>
> Yes.
Thanks, that's a big help.
 about enabling the disk write cache even if I share disks between zfs
* If I have battery backup for the whole server, should I have to worry
 and some other FS?
Just my 20 cents
Seth
Is your point that the atomic updates will no longer be atomic if
there's another filesystem?
The point is that enabling write-caching affects *all* disk accesses, so
including any other mounted filesystems (or even raw block access for
that matter). Other FS-es might rely on write ordering and stuff like
that ini order to stay consistent. They won't be happy if their
'sibling' filesystem chose to enable dangerous features on the disk they
are sharing.
Think: good filesystem neighborship
> Dave wrote:
>> Is your point that the atomic updates will no longer be atomic if
>> there's another filesystem
>
> Not at all.
>
> The point is that enabling write-caching affects *all* disk accesses, so
> including any other mounted filesystems (or even raw block access for
> that matter).
Sure.
> Other FS-es might rely on write ordering and stuff like that in order
> to stay consistent.
Do you mean to tell me that write caching itself is, de-rigeur, not
compatible with the requirements of ordinary filesystems like XFS?
> They won't be happy if their 'sibling' filesystem chose to enable
> dangerous features on the disk they are sharing.
>
> Think: good filesystem neighborship
AFAICT write cacheing XFS is OK as long as write barriers are enabled:
http://tinyurl.com/xfsfaq
Yeah, but as I said, I have battery backup on the whole system.
Shouldn't that be enough to eliminate the power failure worries?
That does not cover crashes. Same impact, system goes down without
finishing disk activity.
--
Blessed are the young for they shall inherit the national debt.
Do you actually trust your battery backup? I've only had moderate success
with them. They are typically poorly maintained and infrequently (or
never) tested. I still use one of course, I just don't trust it :)
I don't know whether I trust it, which is even worse ;-)
> For all practical purposes you should be fine. What does the system do any
> way? Is it a very important server, a home desktop or something in
> between?
I hope to make it very important. I hope to keep all the really
important stuff in ZFS, though, so if a crash or power outage corrupts
the XFS, I'm not too worried -- I'll have a backup of it in the ZFS FS
anyway.
> Even if you don't consider OS crashes, the filesystem could have a bug.
> You're never going to get a 100% guarantee of data integrity. I think the
> checksumming in ZFS is more important than the atomic writes, as far as I
> know the type of corruption that can be caught and fixed by ZFS checksums
> happens more frequently than crashes.
Interesting.
Thanks,
By the way: checksumming without meta-data integrity ('atomic writes')
yields nothing but a neatly checksummed pile of rubble if you ask me.
It's not the one without the other.
Also: 100% guarantee is not what we are after. We have off-site backups
to save us if the O-bomb hits our house. Presumably, the off-site backup
is properly checksummed and trustworthy :)
With many mainstream filesystems (at least in some common performance
modes, like ext3 'ordered-writes' if I remember correctly) enabling
write-caching will prevent the fs driver from doing a reliable (journal)
recovery in the face of a simple powerfailure [2] (without considering
'complications' like hardware damage). This would be what most people
'require' for everyday file storage.
Seth
[1] Once entropy reaches infinity, data can be proven to be impossible :)
[2] I would go out on a limb and conjecture that simple power
failures/hard shut-offs happen a lot more than crashes and 'silent data
corruption' (aka bit-rot) combined :)
PS: on the person that epxressed scepsis on the use of backup-power:
Don't forget this should only be enough power to complete the writing of
unflushed buffers/caches in the controller/kernel pipelines. That would
normally take seconds, not minutes. An extreme case of it would be
delay-writing of e.g. a usb-stick of 4g with more than 4g buffering. You
can see that that takes in the 30s once you issue 'sync' or 'umount'
mostly because of the inferior speed of flash storage.
You are implicitly trusting that the computer will properly receive
notification that the UPS has lost mains power and shutdown is imminent.
It's just as bad to have a UPS that runs your system for 25 minutes after
a power loss, but when it finally dies the computer loses power without
even knowing something was wrong.
It's ironic that my city just had a momentary power hiccup, and that one
of my three UPS battery backups did not survive, and the computer attached
to it was powered off immediately. 66% success rate in a single trial
isn't good. Now I have to go replace that UPS -- it won't even turn on
anymore :(
Also, if you can find me a 4GB USB stick that can write 4G of data in 30
seconds, I'll buy it off of you. For starters, it would need multiple
USB2.0 links (or USB3?).
My point was simply: How often do you test your power loss contingency
setup? How confident are you that your PC will do the right thing
(unattended) if mains power is lost?
> PS: on the person that epxressed scepsis on the use of backup-power:
> Don't forget this should only be enough power to complete the writing of
> unflushed buffers/caches in the controller/kernel pipelines. That would
> normally take seconds, not minutes. An extreme case of it would be
> delay-writing of e.g. a usb-stick of 4g with more than 4g buffering. You
> can see that that takes in the 30s once you issue 'sync' or 'umount'
> mostly because of the inferior speed of flash storage.
You are implicitly trusting that the computer will properly receive
notification that the UPS has lost mains power and shutdown is imminent.
It's just as bad to have a UPS that runs your system for 25 minutes after
a power loss, but when it finally dies the computer loses power without
even knowing something was wrong.
It's ironic that my city just had a momentary power hiccup, and that one
of my three UPS battery backups did not survive, and the computer attached
to it was powered off immediately. Â 66% success rate in a single trial
isn't good. Â Now I have to go replace that UPS -- it won't even turn on
anymore :(
Also, if you can find me a 4GB USB stick that can write 4G of data in 30
seconds, I'll buy it off of you. Â For starters, it would need multiple
USB2.0 links (or USB3?).
My point was simply: Â How often do you test your power loss contingency
setup? Â How confident are you that your PC will do the right thing
(unattended) if mains power is lost?
> Newer Linux filesystems use barriers to make write cahcing "safer" for
> metadata.
*Only* if those underlying devices support it *and* as long as you're not
using the device-mapper layer.
http://hightechsorcery.com/2008/06/linux-write-barriers-write-caching-lvm-and-
filesystems
cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP
On Sat, 10 Jan 2009 7:44:38 am Joseph Mulloy wrote:
> Newer Linux filesystems use barriers to make write caching "safer" for
> metadata.
I *did* find, however, that any decent (non-too-cheapo) UPS will have a
test function, so you can at least test the triggering (software) setup.
I decided at that moment in time that UPS was overkill/unreliable for my
home use. I went with rigorous on- and off-site backups. It's easier to
do, easier to monitor (you can see that it *still* works on a day-to-day
basis), it puts me in control (I can keep things as simple as I wish,
the only complexity being the crypto keys that are in the safe in...
PRINT!). As long as I keep a working knowledge of current hardware and
OS-es I'll be able to simply restore at the cost of jsut bandwidth.
I could still be screwed of course, if a global war takes out both my
home and the internet. But hey, I imagine I have other things to worry
about then.
In my opinion, UPS-es are never meant to sustain power across grid
failures. Instead, the small-business/consumer UPS-es seem to focus on
notification and clean shutdown (system preservation). Only bigtime HA
providers, medical or military facilities will usually have proper
emergency generators. In fact I think that most of these institutions
have learned from past experience to do the same:
1. notify
2. power down
3. power up generators
4. controlled one-by-one restart of vital systems.
Cheers,
Seth
You have me at a loss when you intend to move to ZFS for day-to-day and
then Reiser (?????!) as a backup?!
That should be wholly reversed for a number of reasons.
Reiser is that thing that performs well, is flexible but has a proven
history of being very easy to mess-up especially if you need to recover
anything. In general the ODF is so 'generic' that many things could be
mistaken for a Reiser filesystem image. (Never try to recover/fix a
reiser fs on a disk that contains, e.g. a tarred Reiser image as a
file...). Just google around. Ext3 has much better recoverability track
records.
ZFS (at least zfs-fuse) has a lower performance, a big CPU footprint on
read/write, but excellent (triple-A) data and meta-data consistency.
I use XFS for day-to-day use, local backups on ZFS (because of the data
checksumming).
> On Sat, 10 Jan 2009 7:44:38 am Joseph Mulloy wrote:
>
>> Newer Linux filesystems use barriers to make write cahcing "safer" for
>> metadata.
>
> *Only* if those underlying devices support it *and* as long as you're not
> using the device-mapper layer.
Nooooooooooooooo! My beloved LVM makes things less safe?
> http://hightechsorcery.com/2008/06/linux-write-barriers-write-caching-lvm-and-
> filesystems
/me gnashes teeth
Looks like EVMS is no bettrer in this regard, being built on
device-mapper MD. :(
kernel: [ 16.393477] SGI XFS with ACLs, security attributes, realtime,
large block numbers, no debug enabled
kernel: [ 16.393933] SGI XFS Quota Management subsystem
kernel: [ 16.400657] Filesystem "dm-1": Disabling barriers, trial
barrier write failed
kernel: [ 16.407272] XFS mounting filesystem dm-1
> Sorry forgot to mention the device-mapper thing. I use LVM and md-raid on my
> home server so the write barriers don't do me any good.
Hmm, makes me reconsider again how much of my space to devote to ZFS.
Since I am going to be (among other things) running a virtual build/test
farm, it seems to me that I want the fastest possible storage for
intermediate files, etc., integrity be damned (repositories are hosted
elsewhere). I probably want to reserve the use of ZFS for those things
that I can't afford to lose.
> Another advantage for ZFS-Fuse, you get RAID, Volume Management and
> atomic writes. Is anyone here using it for important data? I suppose I
> could try it and have a nightly tar backup to a reiserfs filesystem
> just in case. Do the NFS and CIFS sharing through the ZFS commands
> work or am I better off just configuring in the standard config files?
IIUC you need to go through the regular Linux interfaces for NFS.
Worked for me when I did it. You also need to have a version of Fuse
that's configured to allow FUSE filesystems to be shared. I don't know
the status of CIFS but I'd guess it's the same.
From what I understand, Xen has architectural advantages that mean KVM
can't really approach it. But that said, KVM is a lot easier to work with.
I wanted to run Xen on the server, but ran into the same thing because I
need to virtualize Windows there and you can't do that without
virtualization instructions. So I'm sticking with VMWare Server 2. You
can get a free NexentaStor VM that will manage up to 6TB of ZFS for you
if you want to play with ZFS that way.
> I'm starting to lean towards having a two server setup.
Yeah, these days it's really hard not to want the virtualization
instructions. One day I'll probably spin off the VMs to a different
machine and dedicate this server to fileserving, so I can run Solaris or
FreeBSD there.
The way I see it, if you need to use LVM (or anything based on device
mapper) and want to keep it safe, you should disable the disk's write
cache.
On zfs-fuse however, things are somewhat complicated because :
- when using LVM, disk write cache needs to be disabled (thus reducing
performance)
- bypassing LVM/MD (i.e. using whole disk or partition) means you can
turn on write cache, but that means you need zfs raidz (unless your
hardware already performs raid).
- zfs-fuse raidz is much slower (about half the performance) of zfs +
linux raid5
Those three factors combined, the feasible choices are :
- zfs on top of LVM/MD with disk write cache disabled, if you trust MD raid
- zfs raidz on real disks, if you can live with much lower performance
Regards,
Fajar
> I use XFS for day-to-day use, local backups on ZFS (because of the data
> checksumming).
That's pretty much what I'm doing now, except I've swapped from XFS to ext4.
> On Sun, Jan 11, 2009 at 8:04 AM, David Abrahams <da...@boostpro.com> wrote:
>>
>>
>> on Sat Jan 10 2009, Chris Samuel <chris-AT-csamuel.org> wrote:
>>
>>> On Sat, 10 Jan 2009 7:44:38 am Joseph Mulloy wrote:
>>>
>>>> Newer Linux filesystems use barriers to make write cahcing "safer" for
>>>> metadata.
>>>
>>> *Only* if those underlying devices support it *and* as long as you're not
>>> using the device-mapper layer.
>>
>> Nooooooooooooooo! My beloved LVM makes things less safe?
>>
>>> http://hightechsorcery.com/2008/06/linux-write-barriers-write-caching-lvm-and-
>>> filesystems
>>
>> /me gnashes teeth
>>
>
> The way I see it, if you need to use LVM (or anything based on device
> mapper) and want to keep it safe, you should disable the disk's write
> cache.
OK, but I'm going to let it be unsafe. I need some fast scratch area;
I'll back up to ZFS for the stuff that counts.
> On zfs-fuse however, things are somewhat complicated because :
> - when using LVM, disk write cache needs to be disabled (thus reducing
> performance)
> - bypassing LVM/MD (i.e. using whole disk or partition) means you can
> turn on write cache, but that means you need zfs raidz (unless your
> hardware already performs raid).
I think copies=2 or copies=3 would work too.
> - zfs-fuse raidz is much slower (about half the performance) of zfs +
> linux raid5
So what does "zfs+linux raid5" really mean?
> Those three factors combined, the feasible choices are :
> - zfs on top of LVM/MD with disk write cache disabled, if you trust MD raid
> - zfs raidz on real disks, if you can live with much lower performance
How about:
* all disks sliced into a couple of partitions
* dmraid (or mdraid? I can't keep those two straight) striping across
all 1st partitions for unreliable but really fast storage
* zpool across all 2nd partitions.
* ZFS filesystems with copies=2 and copies=3
?
What I'm most unsure about with this scheme is how to lay out partitions
and/or LVM volumes to maintain the most flexibility to grow either the
fast part or the reliable part as I use the system.
copies=n protects agains "bad sector" (or something like that). It does
not protect against broken disk. e.g if you use copies=2 without zfs
mirror/raidz/raidz2 on JBOD, and one of the disks is broken, you can say
"bye bye data ..."
>> - zfs-fuse raidz is much slower (about half the performance) of zfs +
>> linux raid5
>>
>
> So what does "zfs+linux raid5" really mean?
>
>
zfs on top of md raid5, or zfs on top of lvm on top of md.
>> Those three factors combined, the feasible choices are :
>> - zfs on top of LVM/MD with disk write cache disabled, if you trust MD raid
>> - zfs raidz on real disks, if you can live with much lower performance
>>
>
> How about:
>
> * all disks sliced into a couple of partitions
>
> * dmraid (or mdraid? I can't keep those two straight) striping across
> all 1st partitions for unreliable but really fast storage
>
> * zpool across all 2nd partitions.
>
> * ZFS filesystems with copies=2 and copies=3
>
> ?
>
> What I'm most unsure about with this scheme is how to lay out partitions
> and/or LVM volumes to maintain the most flexibility to grow either the
> fast part or the reliable part as I use the system.
>
>
If you can live with (currently) slow zfs-fuse raidz and have lots of
disks, it's easier to simply have dedicated disks for zfs, create a
partition on each disks that occupy all available space, and create
zpool with raidz/raidz2 on those partitions.
Regards,
Fajar
Not necessarily. First of all I don't think ZFS even has a "JBOD" mode.
If you add your disks as top level vdevs the default behaviour is
striping (similar to RAID0).
Secondly the copies=2/3 chunk allocator uses a best effort strategy to
locate the data and ditto blocks on different devices. There are
limitations to using copies=2/3 for pool redundancy but in general it
does work if you remove or crash a drive. Hopefully I'll get some free
time at some point and I can look into enhancing this (already great)
feature.
> copies=n protects agains "bad sector" (or something like that). It does
> not protect against broken disk. e.g if you use copies=2 without zfs
> mirror/raidz/raidz2 on JBOD, and one of the disks is broken, you can say
> "bye bye data ..."
I don't think so; see Jonathan Schmidt's response.
>>> - zfs-fuse raidz is much slower (about half the performance) of zfs +
>>> linux raid5
>>>
>>
>> So what does "zfs+linux raid5" really mean?
>>
>>
>
> zfs on top of md raid5, or zfs on top of lvm on top of md.
OK, but that still has a raid-5 write hole, right?
>>> Those three factors combined, the feasible choices are :
>>> - zfs on top of LVM/MD with disk write cache disabled, if you trust MD raid
>>> - zfs raidz on real disks, if you can live with much lower performance
>>>
>>
>> How about:
>>
>> * all disks sliced into a couple of partitions
>>
>> * dmraid (or mdraid? I can't keep those two straight) striping across
>> all 1st partitions for unreliable but really fast storage
>>
>> * zpool across all 2nd partitions.
>>
>> * ZFS filesystems with copies=2 and copies=3
>>
>> ?
>>
>> What I'm most unsure about with this scheme is how to lay out partitions
>> and/or LVM volumes to maintain the most flexibility to grow either the
>> fast part or the reliable part as I use the system.
>>
>
> If you can live with (currently) slow zfs-fuse raidz and have lots of
> disks,
I have 8. Don't know if that's "lots."
> it's easier to simply have dedicated disks for zfs, create a partition
> on each disks that occupy all available space, and create zpool with
> raidz/raidz2 on those partitions.
Yes, it's easier. However, I'm not sure how my storage needs will
evolve, so it's not necessarily more workable.
Have you tried this? Is there a test case for this?
My tests so far indicate that if :
- I have zpool on multiple vdefs (using files, for test purposes),
striped (not mirror, not raidz), created with
# dd if=/dev/zero of=/tmp/disk1 bs=1M seek=100 count=0
0+0 records in
0+0 records out
0 bytes (0 B) copied, 4.7763e-05 s, 0.0 kB/s
# dd if=/dev/zero of=/tmp/disk2
0+0 records in
0+0 records out
0 bytes (0 B) copied, 4.8191e-05 s, 0.0 kB/s
# zpool create test /tmp/disk1 /tmp/disk2
- set copies=2 (or more)
- put some test files on it
- export zpool
- remove one of the vdefs (in this case /tmp/test2)
- try to import the pool
It will fail
# zpool import -d /tmp
pool: test
id: 17946218857558819078
state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
see: http://www.sun.com/msg/ZFS-8000-6X
config:
test UNAVAIL missing device
/tmp/disk1 ONLINE
Additional devices are known to be part of this pool, though their
exact configuration cannot be determined.
So I'm not sure what you mean by "but in general it does work if you
remove or crash a drive", unless I'm doing something wrong with the
test.
Yes, I have tested that configuration and I did not encounter the error
you did ("Additional devices are known to be part of this pool, though
their exact configuration cannot be determined"). I got something more
like this:
test DEGRADED (I think?)
/tmp/disk1 ONLINE
/tmp/disk2 MISSING
It wasn't exactly that (but something close) and unfortunately I can't
test it out at the moment. However, the pool was importable, mountable,
and the files were intact. I'm curious though, are your temp "disk" files
0 bytes long? Mine were 100MB.
> Yes, I have tested that configuration and I did not encounter the error
> you did ("Additional devices are known to be part of this pool, though
> their exact configuration cannot be determined"). I got something more
> like this:
>
> test DEGRADED (I think?)
> /tmp/disk1 ONLINE
> /tmp/disk2 MISSING
>
Are you sure it was a stripe setup? It seems like a message for a
mirror/raidz setup
> It wasn't exactly that (but something close) and unfortunately I can't
> test it out at the moment. However, the pool was importable, mountable,
> and the files were intact. I'm curious though, are your temp "disk" files
> 0 bytes long? Mine were 100MB.
>
It was 0 before pool creation because it was created as a sparse file
(with the seek=100 and count=0). Now it occupies about 1.3 MB.
I'm positive it was striped (that was the whole reason for the test).
"DEGRADED" might be the wrong status, hence the uncertainty on that line.
The disk2 did list as missing though. Your pool seems to have forgotten
about it entirely.
I ran this test a long time ago (back when 0.4.0 was new) so maybe the
behaviour has changed since then? I'll test it again when I get the
chance.
>> It wasn't exactly that (but something close) and unfortunately I can't
>> test it out at the moment. However, the pool was importable, mountable,
>> and the files were intact. I'm curious though, are your temp "disk"
>> files
>> 0 bytes long? Mine were 100MB.
>>
>
> It was 0 before pool creation because it was created as a sparse file
> (with the seek=100 and count=0). Now it occupies about 1.3 MB.
Right.
>>> Yes, I have tested that configuration and I did not encounter the error
>>> you did ("Additional devices are known to be part of this pool, though
>>> their exact configuration cannot be determined"). I got something more
>>> like this:
>>>
>>> test DEGRADED (I think?)
>>> /tmp/disk1 ONLINE
>>> /tmp/disk2 MISSING
>>>
>>
>> Are you sure it was a stripe setup? It seems like a message for a
>> mirror/raidz setup
>
> I'm positive it was striped (that was the whole reason for the test).
> "DEGRADED" might be the wrong status, hence the uncertainty on that line.
> The disk2 did list as missing though. Your pool seems to have forgotten
> about it entirely.
>
> I ran this test a long time ago (back when 0.4.0 was new) so maybe the
> behaviour has changed since then? I'll test it again when I get the
> chance.
Once again, patiently waiting for new results before deciding how to
proceed :-)
Thanks all,
Here's what I'd do :
(1) Use hardware raid or Linux MD to provide redundancy
Yes, it's possible to have raid-5 write hole. Having a good storage
controller with battery-backed cache to handle raid helps prevent
this.
(2) Depending on your setup, you may need to turn OFF write caching on
each disk to ensure data safety.
(3) Setup LVM on top of MD
(4) Allocate PVs for zfs-fuse or ext3 as necessary, leaving room to
grow. For example, from 1TB VG, allocate 100GB for zfs-fuse and 100GB
for ext3. Leave 800GB untouched for now.
(5) Create zpool using only one LV as vdef. After that, either :
- turn of checksum, OR
- leave zfs checksum on and have copies=2 for important data
(6) If you need more space for zfs, create another LV, and add that LV
as another vdef
(7) If you need more space for ext3, grow the LV and use resize2fs
This setup should give you most flexibility, and a balance between
performance and data safety.
Regards,
Fajar
> On Sat, Jan 17, 2009 at 6:35 AM, David Abrahams <da...@boostpro.com> wrote:
>> Once again, patiently waiting for new results before deciding how to
>> proceed :-)
>
> Here's what I'd do :
> (1) Use hardware raid or Linux MD to provide redundancy
> Yes, it's possible to have raid-5 write hole. Having a good storage
> controller with battery-backed cache to handle raid helps prevent
> this.
Why accept a raid-5 write hole for my critical data if I don't have to?
> (2) Depending on your setup, you may need to turn OFF write caching on
> each disk to ensure data safety.
Yeah, not necessary if I do what I'm planning, since all critical data
would go on ZFS.
> (3) Setup LVM on top of MD
> (4) Allocate PVs for zfs-fuse or ext3 as necessary, leaving room to
> grow. For example, from 1TB VG, allocate 100GB for zfs-fuse and 100GB
> for ext3. Leave 800GB untouched for now.
> (5) Create zpool using only one LV as vdef. After that, either :
> - turn of checksum, OR
> - leave zfs checksum on and have copies=2 for important data
> (6) If you need more space for zfs, create another LV, and add that LV
> as another vdef
> (7) If you need more space for ext3, grow the LV and use resize2fs
>
> This setup should give you most flexibility, and a balance between
> performance and data safety.
Hum. Downsides:
1. I can't read the disks on anything but a Linux system. If I want to
run FreeBSD or Solaris to get native ZFS, I'm out of luck.
2. RAID-5 write hole
I think I should slice each disk into two physical partitions and make a
ZFS pool out of a partition from each disk, so I can mount the pool from
some other OS. I'd use the other partitions with LVM. It would be
ideal if it were possible to grow these zfs partitions later, when I am
ready to toss out the LVM stuff, but I'm not sure if that's really
possible. Anybody know?
If yes, then you might need to accept the POSSIBILITY of raid-5 write
hole, or switch a different server/OS combination. Note that
battery-backed cache on some storage controllers can help reduce the
possibility of the write hole.
If not, your original setup sounds good enough.
>> This setup should give you most flexibility, and a balance between
>> performance and data safety.
>>
>
>
> Hum. Downsides:
>
> 1. I can't read the disks on anything but a Linux system. If I want to
> run FreeBSD or Solaris to get native ZFS, I'm out of luck.
>
> 2. RAID-5 write hole
>
>
Correct. At this point I'd say you need to prioritize. What's on top of
your list? Is it performance? scalability? Data integrity?
> I think I should slice each disk into two physical partitions and make a
> ZFS pool out of a partition from each disk, so I can mount the pool from
> some other OS. I'd use the other partitions with LVM. It would be
> ideal if it were possible to grow these zfs partitions later, when I am
> ready to toss out the LVM stuff, but I'm not sure if that's really
> possible. Anybody know?
>
>
It's possible, as in you can use the space previously used by LVM as
vdevs (in stripe/raidz/raidz2, your choice) and use the vdev as :
- a new pool, or
- as an addition to grow your old pool
What you can't do is change the raidz configuration from (say) 8-vdev
raidz to 16-vdev raidz
Regards,
Fajar
No, but you can add a second 8-vdev raidz, keeping the same level of
redundancy. Switching to a 16-vdev raidz would lower the fault
tolerance of the array.
Regards,
Fajar
> David Abrahams wrote:
>> on Sat Jan 17 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:
>>
>>> Here's what I'd do :
>>> (1) Use hardware raid or Linux MD to provide redundancy
>>> Yes, it's possible to have raid-5 write hole. Having a good storage
>>> controller with battery-backed cache to handle raid helps prevent
>>> this.
>>>
>>
>> Why accept a raid-5 write hole for my critical data if I don't have to?
>>
>>
> Let's see :
> - Are you using a storage controller that can't present the real disk to
> the OS (some HP Proliant models come to mind)?
I don't think so.
> - Do you need more I/O throughput than what zfs-fuse raidz can give
> you?
I'd like that, but I don't need that part of the storage to be
ZFS-reliable.
> - Do you often need to change the disk allocation between ext3 and zfs?
Who uses ext3? ;-)
No, not yet.
> If yes, then you might need to accept the POSSIBILITY of raid-5 write
> hole, or switch a different server/OS combination. Note that
> battery-backed cache on some storage controllers can help reduce the
> possibility of the write hole.
>
> If not, your original setup sounds good enough.
>
>>> This setup should give you most flexibility, and a balance between
>>> performance and data safety.
>>>
>>
>>
>> Hum. Downsides:
>>
>> 1. I can't read the disks on anything but a Linux system. If I want to
>> run FreeBSD or Solaris to get native ZFS, I'm out of luck.
>>
>> 2. RAID-5 write hole
>>
>>
> Correct. At this point I'd say you need to prioritize. What's on top of
> your list? Is it performance? scalability? Data integrity?
I hate to choose ;-)
>> I think I should slice each disk into two physical partitions and make a
>> ZFS pool out of a partition from each disk, so I can mount the pool from
>> some other OS. I'd use the other partitions with LVM. It would be
>> ideal if it were possible to grow these zfs partitions later, when I am
>> ready to toss out the LVM stuff, but I'm not sure if that's really
>> possible. Anybody know?
>>
> It's possible, as in you can use the space previously used by LVM as
> vdevs (in stripe/raidz/raidz2, your choice) and use the vdev as :
> - a new pool, or
> - as an addition to grow your old pool
I don't think using it to grow an old pool with copies=xxx would be a
good plan, because then copies are more likely to end up on the same
disk. I think that leaves adding a new raid-z pool. Still, I would
prefer to end up with whole-disk vdevs if I throw out the LVM parts.
> What you can't do is change the raidz configuration from (say) 8-vdev
> raidz to 16-vdev raidz
Yeah, that's one reason to go with copies=xxx redundancy instead.
>> At this point I'd say you need to prioritize. What's on top of
>> your list? Is it performance? scalability? Data integrity?
>>
>
> I hate to choose ;-)
>
>
Welcome to the real world :D
As a side note, an opensolaris advocate would probably say that :
- switching to Opensolaris, or
- using an external opensolaris-based storage server and export the
disks using iscsi or nfs
would get you all of those three.
>>> I think I should slice each disk into two physical partitions and make a
>>> ZFS pool out of a partition from each disk, so I can mount the pool from
>>> some other OS. I'd use the other partitions with LVM. It would be
>>> ideal if it were possible to grow these zfs partitions later, when I am
>>> ready to toss out the LVM stuff, but I'm not sure if that's really
>>> possible. Anybody know?
>>>
>>>
>> It's possible, as in you can use the space previously used by LVM as
>> vdevs (in stripe/raidz/raidz2, your choice) and use the vdev as :
>> - a new pool, or
>> - as an addition to grow your old pool
>>
>
> I don't think using it to grow an old pool with copies=xxx would be a
> good plan, because then copies are more likely to end up on the same
> disk.
Who said anything about copies=n?
Growing your old pool would mean that if you initially do
zpool create datapool raidz sdb2 sdc2 sdd2 sde2 sdf2
you can grow it later using
zpool add datapool raidz sdb1 sdc1 sdd1 sde1 sdf1
This is assuming sd[b-f]1 is previously used for linux partitions.
>From zfs's point of view, the first and second raidz's redundancy would
be managed independently. So if (say) sdb is broken, two resilver
processes takes place : one for the sdx2 series, and one for the sdx1
series.
> I think that leaves adding a new raid-z pool. Still, I would
> prefer to end up with whole-disk vdevs if I throw out the LVM parts.
>
>
Yeah, that would be good. The problem is AFAICT growing/shrinking a
raidz vdev is not passible ATM. You'd need to export-import data. Again,
this depends on your priorities. If it were me, I'd live with
partition-vdev instead oh having to export-import several hundred GBs of
data.
Regards,
Fajar
> David Abrahams wrote:
>> on Sun Jan 18 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:
>>
>>
>>> - Do you often need to change the disk allocation between ext3 and zfs?
>>>
>>
>> Who uses ext3? ;-)
>>
> Okay, whatever non-zfs filesystem that you use then :)
>
>>> At this point I'd say you need to prioritize. What's on top of
>>> your list? Is it performance? scalability? Data integrity?
>>>
>>
>> I hate to choose ;-)
>>
>>
> Welcome to the real world :D
> As a side note, an opensolaris advocate would probably say that :
> - switching to Opensolaris, or
> - using an external opensolaris-based storage server and export the
> disks using iscsi or nfs
> would get you all of those three.
I know. Unfortunately there are a few other restrictions like the fact
that I want to be able to virtualize FreeBSD on this machine, which
isn't a well supported guest of by VirtualBox. Also I have some
existing VMware VMs that need to work.
>>>> I think I should slice each disk into two physical partitions and make a
>>>> ZFS pool out of a partition from each disk, so I can mount the pool from
>>>> some other OS. I'd use the other partitions with LVM. It would be
>>>> ideal if it were possible to grow these zfs partitions later, when I am
>>>> ready to toss out the LVM stuff, but I'm not sure if that's really
>>>> possible. Anybody know?
>>>>
>>>>
>>> It's possible, as in you can use the space previously used by LVM as
>>> vdevs (in stripe/raidz/raidz2, your choice) and use the vdev as :
>>> - a new pool, or
>>> - as an addition to grow your old pool
You can add a new raidz2 to an existing pool? Awesome.
>> I don't think using it to grow an old pool with copies=xxx would be a
>> good plan, because then copies are more likely to end up on the same
>> disk.
>
> Who said anything about copies=n?
I did, remember?
> Growing your old pool would mean that if you initially do
>
> zpool create datapool raidz sdb2 sdc2 sdd2 sde2 sdf2
>
> you can grow it later using
>
> zpool add datapool raidz sdb1 sdc1 sdd1 sde1 sdf1
Yeah, I like that.
> This is assuming sd[b-f]1 is previously used for linux partitions.
>>From zfs's point of view, the first and second raidz's redundancy would
> be managed independently. So if (say) sdb is broken, two resilver
> processes takes place : one for the sdx2 series, and one for the sdx1
> series.
That's tolerable I think.
>> I think that leaves adding a new raid-z pool. Still, I would
>> prefer to end up with whole-disk vdevs if I throw out the LVM parts.
>>
>>
> Yeah, that would be good. The problem is AFAICT growing/shrinking a
> raidz vdev is not passible ATM. You'd need to export-import data. Again,
> this depends on your priorities. If it were me, I'd live with
> partition-vdev instead oh having to export-import several hundred GBs of
> data.
Yeah, me too. So now I'm thinking:
+-----+-----+-----+-----+-----+-----+-----+-----+
| sda | sdb | sdc | sdd | sde | sdf | sdg | sdh |
+=====+=====+=====+=====+=====+=====+=====+=====+
|XXXXX|XXXXX|XXXXX|XXXXX|XXXXX|XXXXX|XXXXX|XXXXX|
+-----+-----+-----+-----+-----+-----+-----+-----+
|YYYYY|YYYYY|YYYYY|YYYYY|YYYYY|YYYYY|YYYYY|YYYYY|
+-----+-----+-----+-----+-----+-----+-----+-----+
|ZZZZZ|ZZZZZ|ZZZZZ|ZZZZZ|ZZZZZ|ZZZZZ|ZZZZZ|ZZZZZ|
+-----+-----+-----+-----+-----+-----+-----+-----+
| | | | | | | | |
. .
. Uncommitted space .
. .
| | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+
Where:
XXXXs are a dmraid RAID5 containing Linux boot and root. "counting
on" my UPS to mitigate the chance of disaster here.
YYYYs are a dmraid RAID0 for a volatile scratch area (e.g. object
files created when tests are built)
ZZZZs are RAIDZ2 for critical data.
Write cache enabled.
I can add more of X, Y, or Z in the uncommitted space as my needs
evolve. Probably by the time I need another slice of Zs, BTRFS will be
ready for action anyway ;-)