[zfs-fuse] raw or partitioned disks

247 views
Skip to first unread message

unicron

unread,
Jan 6, 2009, 2:23:25 PM1/6/09
to zfs-fuse
Hi,

I've been experimenting with ZFS for some time now using 3x 1TB disks
in a raidZ config. Now, i'm looking to create a more permanent pool,
which in the future i like to grow by swapping the 1TB for larger
ones.

My current pool uses raw disks (/dev/sdX), but I read that pools using
raw disk cannot be imported on other OS's. correct? This seems not
logical to me, because both zfs implementations are the 'same' on
different OS's

Another option would be to use partitioned disks (/dev/sdX1). I like
to know what the pro's and con's of these two options are? Are both
able to increase the size of the pool by swapping disks?


P.S. is it also true that I can better use /dev/disk/by-id instead of /
dev/sdX? I think this can only be applied when using raw disks.
correct?

Greetings

Jonathan Schmidt

unread,
Jan 6, 2009, 3:18:27 PM1/6/09
to zfs-...@googlegroups.com
> Hi,
>
> I've been experimenting with ZFS for some time now using 3x 1TB disks
> in a raidZ config. Now, i'm looking to create a more permanent pool,
> which in the future i like to grow by swapping the 1TB for larger
> ones.
>
> My current pool uses raw disks (/dev/sdX), but I read that pools using
> raw disk cannot be imported on other OS's. correct? This seems not
> logical to me, because both zfs implementations are the 'same' on
> different OS's
>
> Another option would be to use partitioned disks (/dev/sdX1). I like
> to know what the pro's and con's of these two options are? Are both
> able to increase the size of the pool by swapping disks?

I don't know the answer to the above, sorry.

> P.S. is it also true that I can better use /dev/disk/by-id instead of /
> dev/sdX? I think this can only be applied when using raw disks.
> correct?

No, the /dev/disk/by-id directory contains symlinks to each partition of
the disks, along with the disks themselves. So /dev/sda might be
/dev/disk/by-id/xxxxx and /dev/sda1 would be /dev/disk/by-id/xxxxx-part1.

Eric Krieger

unread,
Jan 6, 2009, 5:19:05 PM1/6/09
to zfs-...@googlegroups.com
On Tue, 2009-01-06 at 11:23 -0800, unicron wrote:
> Hi,
>
> I've been experimenting with ZFS for some time now using 3x 1TB disks
> in a raidZ config. Now, i'm looking to create a more permanent pool,
> which in the future i like to grow by swapping the 1TB for larger
> ones.
>
> My current pool uses raw disks (/dev/sdX), but I read that pools using
> raw disk cannot be imported on other OS's. correct? This seems not
> logical to me, because both zfs implementations are the 'same' on
> different OS's


That is 100% correct. If you do not partition your drives before adding
them to a zpool you will not be able to import or export your zpools
correctly to other os or if you have to rebuild your current os.

- Eric

----------------------------------------
Eric J. Krieger
Ubuntu Member and Altruistic Network Ninja.

Email: grammat...@ubuntu.com
Wiki: https://wiki.ubuntu.com/GrammatonCleric
NM LoCo: https://wiki.ubuntu.com/NewMexicoTeam


Random Quote:

"No matter what the game, no matter
what the rules, the same rules apply
to both sides!"
-Hoyle's Law

unicron

unread,
Jan 7, 2009, 2:00:01 PM1/7/09
to zfs-fuse


On 6 jan, 23:19, Eric Krieger <grammatoncle...@ubuntu.com> wrote:
> On Tue, 2009-01-06 at 11:23 -0800, unicron wrote:
> > Hi,
>
> > I've been experimenting with ZFS for some time now using 3x 1TB disks
> > in a raidZ config. Now, i'm looking to create a more permanent pool,
> > which in the future i like to grow by swapping the 1TB for larger
> > ones.
>
> > My current pool uses raw disks (/dev/sdX), but I read that pools using
> > raw disk cannot be imported on other OS's. correct? This seems not
> > logical to me, because both zfs implementations are the 'same' on
> > different OS's
>
> That is 100% correct.  If you do not partition your drives before adding
> them to a zpool you will not be able to import or export your zpools
> correctly to other os or if you have to rebuild your current os.
>

Okay, but why? Are raw zpool's in Solaris different?

> - Eric
>
> ----------------------------------------
> Eric J. Krieger
> Ubuntu Member and Altruistic Network Ninja.
>
> Email: grammatoncle...@ubuntu.com
> Wiki:https://wiki.ubuntu.com/GrammatonCleric
> NM LoCo:https://wiki.ubuntu.com/NewMexicoTeam
>
> Random Quote:
>
> "No matter what the game, no matter
> what the rules, the same rules apply
> to both sides!"
>         -Hoyle's Law


> > P.S. is it also true that I can better use /dev/disk/by-id instead of /
> >dev/sdX? I think this can only be applied when using raw disks.
> > correct?


> No, the /dev/disk/by-id directory contains symlinks to each partition of
> the disks, along with the disks themselves. So /dev/sda might be
> /dev/disk/by-id/xxxxx and /dev/sda1 would be /dev/disk/by-id/xxxxx-part1.

So the best option is to use these symlinks, both with raw and
partitioned disks?

sghe...@hotmail.com

unread,
Jan 7, 2009, 2:14:03 PM1/7/09
to zfs-...@googlegroups.com
unicron wrote:
>> That is 100% correct. If you do not partition your drives before adding
>> them to a zpool you will not be able to import or export your zpools
>> correctly to other os or if you have to rebuild your current os.
>>
>>
> Okay, but why? Are raw zpool's in Solaris different?
>
>
Unicorn, I think you are coming across more stubborn than you intend
(:)) but right now I only feel the need to redirect the question 180
degrees. Why does it matter to you?

Are you giving a suggestion? Are you asking for documentation to be
updated? Are you questioning the advice you are getting? Are you asking
for new features? Are you simply curious?

As long as you just keep repeating 'but why' I think it might be a long
time before anyone will find the time to answer your potential question.
It is pretty clear that the answer isn't obvious (or you'd have 6 copies
of it by now) so ... in case of curiosity: the source is all yours!
("Use the sorce, Luke!")

Now, enough of this - I'd probably dive into the code myself if I had
the time tight now

Seth

Jonathan Schmidt

unread,
Jan 7, 2009, 3:59:00 PM1/7/09
to zfs-...@googlegroups.com
>> > P.S. is it also true that I can better use /dev/disk/by-id instead of
>> > /dev/sdX? I think this can only be applied when using raw disks.

>> > correct?
>
>
>> No, the /dev/disk/by-id directory contains symlinks to each partition of
>> the disks, along with the disks themselves. So /dev/sda might be
>> /dev/disk/by-id/xxxxx and /dev/sda1 would be
>> /dev/disk/by-id/xxxxx-part1.
>
> So the best option is to use these symlinks, both with raw and
> partitioned disks?

Yep

David Abrahams

unread,
Jan 7, 2009, 6:32:05 PM1/7/09
to zfs-...@googlegroups.com

on Wed Jan 07 2009, "sgheeren-AT-hotmail.com" <sgheeren-AT-hotmail.com> wrote:

> unicron wrote:
>>> That is 100% correct. If you do not partition your drives before adding
>>> them to a zpool you will not be able to import or export your zpools
>>> correctly to other os or if you have to rebuild your current os.
>>>
>>>
>> Okay, but why? Are raw zpool's in Solaris different?
>>
>>
> Unicorn, I think you are coming across more stubborn than you intend
> (:)) but right now I only feel the need to redirect the question 180
> degrees. Why does it matter to you?

It matters to me because I want to understand the system that I'm using.

> Are you giving a suggestion? Are you asking for documentation to be
> updated? Are you questioning the advice you are getting? Are you
> asking for new features? Are you simply curious?

Well, I'd really like to know, too. It was always my assumption that if
I used raw vdev's I'd not only be following ZFS best practices, but I'd
also be able to use the pool under FreeBSD or (Open)Solaris when I
eventually got some more powerful hardware for hosting my VMs. This
whole thing is coming as a bit of a shock, and making me wonder if I
should be heading down a totally different road (btrfs, dmraid, or
something else).

> As long as you just keep repeating 'but why' I think it might be a
> long time before anyone will find the time to answer your potential
> question. It is pretty clear that the answer isn't obvious (or you'd
> have 6 copies of it by now) so ... in case of curiosity: the source is
> all yours! ("Use the sorce, Luke!")

Well, c'mon now. If the answer _really_ isn't obvious it's pretty
unlikely that the source is going to reveal it easily. Someone who
understands this issue ought to at least be able to provide some brief,
initially-incomprehensible explanation that we can chew on until we get
it.

--
Dave Abrahams
BoostPro Computing
http://www.boostpro.com

sghe...@hotmail.com

unread,
Jan 7, 2009, 6:42:18 PM1/7/09
to zfs-...@googlegroups.com

> Well, c'mon now. If the answer _really_ isn't obvious it's pretty
> unlikely that the source is going to reveal it easily. Someone who
> understands this issue ought to at least be able to provide some brief,
> initially-incomprehensible explanation that we can chew on until we get
> it.
>
I hear you. I hate it too :)

My guess is: there is one (lead) dev and he is currently busy (R.
Correia). He'll probably (*just guessing*) be answering along the lines
of: it was easier to implement vdev management against the kernel API
for partitions than meddling with raw block devices (and their locking),
at least from fuse. I might be wrong, but it seems like such a priority
call is very likely to happen when you wish to port a thing like this
and first-and-foremost want to have it running?

If we want to know... 2 options: wait for God from the cloud the answer
(Riccardo, is that you?) or look at the source.
Nah I don't think the source will be confusing. I just think no-one
reads it. (Sorry devs: I'm not ignoring your efforts here! I'm simply
exaggerating the lack of resources in order to illustrate my theory). If
you read it, chances are it'll readily jump out (perhaps some form of
comment inline).

If this wasn't incomprehensible than may I apologize by saying it wasn't
brief either LOL

Cheers

Seth

PS. Did I mention little of the above is based on facts, conversation,
prior or acquired knowledge? I am *just guessing*

unicron

unread,
Jan 8, 2009, 2:52:31 PM1/8/09
to zfs-fuse
Hi,

sorry for all the why's, but I'm basically curious and I like to
understand the filesystem I use (ofcourse to a certain limit). I
understand that it is a question that few people can answer and I am
very happy to wait for a while.

I also like to thank R. Correia for his great work on making ZFS
available on Linux.

P.S. I'm currently using the 0.5.0 version from the ZFS-FUSE blog
webpage. I've seen some technical discussions in this google group
about changes that are made since 0.5.0 was released. Where can I find
the latest repository? and is recommended to use it (as in does it fix
critical things when using raidz)?

David Abrahams

unread,
Jan 8, 2009, 3:22:31 PM1/8/09
to zfs-...@googlegroups.com

on Wed Jan 07 2009, "sgheeren-AT-hotmail.com" <sgheeren-AT-hotmail.com> wrote:

>> Well, c'mon now. If the answer _really_ isn't obvious it's pretty
>> unlikely that the source is going to reveal it easily. Someone who
>> understands this issue ought to at least be able to provide some brief,
>> initially-incomprehensible explanation that we can chew on until we get
>> it.
>>
> I hear you. I hate it too :)

I guess I'll probably try serving ZFS from a FreeBSD VM, then.

> My guess is: there is one (lead) dev and he is currently busy (R.
> Correia). He'll probably (*just guessing*) be answering along the lines
> of: it was easier to implement vdev management against the kernel API
> for partitions than meddling with raw block devices (and their locking),
> at least from fuse. I might be wrong, but it seems like such a priority
> call is very likely to happen when you wish to port a thing like this
> and first-and-foremost want to have it running?
>
> If we want to know... 2 options: wait for God from the cloud the answer
> (Riccardo, is that you?) or look at the source.
> Nah I don't think the source will be confusing. I just think no-one
> reads it. (Sorry devs: I'm not ignoring your efforts here! I'm simply
> exaggerating the lack of resources in order to illustrate my theory). If
> you read it, chances are it'll readily jump out (perhaps some form of
> comment inline).

Yeah, but sorry, I don't know the first thing about kernel APIs for
partitions /or/ meddling with raw block devices. I did take a crawl
through the source but came up empty.

> If this wasn't incomprehensible than may I apologize by saying it wasn't
> brief either LOL

Yeah, you failed on both counts ;-)

sghe...@hotmail.com

unread,
Jan 8, 2009, 4:15:58 PM1/8/09
to zfs-...@googlegroups.com
unicron wrote:
> P.S. I'm currently using the 0.5.0 version from the ZFS-FUSE blog
> webpage. I've seen some technical discussions in this google group
> about changes that are made since 0.5.0 was released. Where can I find
> the latest repository? and is recommended to use it (as in does it fix
> critical things when using raidz)?
>
hg clone http://www.wizy.org/mercurial/zfs-fuse/trunk

Changelog:

hg log

Or see CHANGES file
(http://www.wizy.org/mercurial/zfs-fuse/trunk/file/008c531499cd/CHANGES)

Uwe Kubosch

unread,
Jan 8, 2009, 6:08:59 PM1/8/09
to zfs-...@googlegroups.com
On Thu, 2009-01-08 at 11:52 -0800, unicron wrote:
> P.S. I'm currently using the 0.5.0 version from the ZFS-FUSE blog
> webpage. I've seen some technical discussions in this google group
> about changes that are made since 0.5.0 was released. Where can I find
> the latest repository? and is recommended to use it (as in does it fix
> critical things when using raidz)?

The latest changes are recommended, and they are in the latest RPM
available from the Fedora repository. I am maintaining this package,
and would like as much feedback on as possible, therefore the shameless
plug :)

--
With kind regards,
Uwe Kubosch
Kubosch Consulting
Norway

signature.asc

unicron

unread,
Jan 9, 2009, 3:45:18 AM1/9/09
to zfs-fuse
Hi,

Yes, I read the thread on Fedora in this google groups. Unfortunately,
I'm running Ubuntu 8.04 and not Fedora.
>  signature.asc
> < 1KViewDownload

Fajar A. Nugraha

unread,
Jan 9, 2009, 4:46:05 AM1/9/09
to zfs-...@googlegroups.com
On Fri, Jan 9, 2009 at 3:45 PM, unicron <jeroen.d...@gmail.com> wrote:
>
> Hi,
>
> Yes, I read the thread on Fedora in this google groups. Unfortunately,
> I'm running Ubuntu 8.04 and not Fedora.

Since you're using Ubuntu, why not use Filip Brcic's ppa?
https://wiki.ubuntu.com/ZFS/
http://ppa.launchpad.net/brcha/ubuntu/

I believe it uses changeset 375 from trunk

Ricardo M. Correia

unread,
Jan 9, 2009, 1:21:02 PM1/9/09
to zfs-...@googlegroups.com
Hi,

Sorry for taking so long to answer, I've been a bit busy as you may have
guessed.

On Qui, 2009-01-08 at 11:52 -0800, unicron wrote:
> sorry for all the why's, but I'm basically curious and I like to
> understand the filesystem I use (ofcourse to a certain limit). I
> understand that it is a question that few people can answer and I am
> very happy to wait for a while.

So here's what happens:

When you create a ZFS pool on a whole disk in Solaris (or you add a new,
whole disk to an existing pool), the zpool command will use libdiskmgt
to create an EFI/GPT disk label on the disk with a single large
partition (and also a very small one, but this is irrelevant), and then
it will use this partition to store the data.

However, in Linux, this doesn't happen because there is no libdiskmgt
and it would be hard to port it, so zfs-fuse will simply treat a raw
disk like any other block device (i.e. it will just write the data to
it, it won't create a GPT label like in Solaris).

The problem with the latter, is that Solaris doesn't like disks without
labels/partitions, so if you create your pool on a raw disk with
zfs-fuse, you won't be able to import it on Solaris.

I have no idea how FreeBSD handles pools on raw disks, though (but my
guess is that it would work).

My recommendation is to always create an msdos label on your disk,
because there are some EFI/GPT incompatibilities between Solaris and
Linux. You can do this with the 'fdisk' program.

BTW, I've worked on a patch to make zpool create EFI labels like in
Solaris, but with libparted instead of libdiskmgt.
If anyone is interested, it's here:
https://bugzilla.lustre.org/show_bug.cgi?id=14548

Unfortunately, a zpool command with this patch can't currently be
distributed because the libparted license (GPLv2) is incompatible with
the zpool license (CDDL).

> P.S. I'm currently using the 0.5.0 version from the ZFS-FUSE blog
> webpage. I've seen some technical discussions in this google group
> about changes that are made since 0.5.0 was released. Where can I find
> the latest repository? and is recommended to use it (as in does it fix
> critical things when using raidz)?

You can find a pointer to the Mercurial repository in the main web page:
http://www.wizy.org/wiki/ZFS_on_FUSE

There have been some fixes, one of which related to SCSI devices that
might be important to avoid corruption.

You can find the changelog in the repository by going to the link below,
and then clicking on 'files' and then 'CHANGES':

http://www.wizy.org/mercurial/zfs-fuse/trunk/

HTH,
Ricardo


David Abrahams

unread,
Jan 9, 2009, 2:13:09 PM1/9/09
to zfs-...@googlegroups.com

on Fri Jan 09 2009, "Ricardo M. Correia" <Ricardo.M.Correia-AT-Sun.COM> wrote:

> Hi,
>
> Sorry for taking so long to answer, I've been a bit busy as you may have
> guessed.

Thanks so much for following up.

> On Qui, 2009-01-08 at 11:52 -0800, unicron wrote:
>> sorry for all the why's, but I'm basically curious and I like to
>> understand the filesystem I use (ofcourse to a certain limit). I
>> understand that it is a question that few people can answer and I am
>> very happy to wait for a while.
>
> So here's what happens:
>
> When you create a ZFS pool on a whole disk in Solaris (or you add a new,
> whole disk to an existing pool), the zpool command will use libdiskmgt
> to create an EFI/GPT disk label on the disk with a single large
> partition (and also a very small one, but this is irrelevant), and then
> it will use this partition to store the data.

So, IIUC, you are saying that even though the official ZFS
recommendation is to use whole disks, when you do that, Solaris actually
partitions the disk under the covers anyway?

If so, why the emphasis on using whole disks? If not... I'm
really confused!

> However, in Linux, this doesn't happen because there is no libdiskmgt
> and it would be hard to port it, so zfs-fuse will simply treat a raw
> disk like any other block device (i.e. it will just write the data to
> it, it won't create a GPT label like in Solaris).
>
> The problem with the latter, is that Solaris doesn't like disks without
> labels/partitions, so if you create your pool on a raw disk with
> zfs-fuse, you won't be able to import it on Solaris.
>
> I have no idea how FreeBSD handles pools on raw disks, though (but my
> guess is that it would work).

You mean, you guess it would work without the label?

> My recommendation is to always create an msdos label on your disk,
> because there are some EFI/GPT incompatibilities between Solaris and
> Linux. You can do this with the 'fdisk' program.

So... Solaris can read msdos-labelled disks?

Ricardo M. Correia

unread,
Jan 9, 2009, 2:38:00 PM1/9/09
to zfs-...@googlegroups.com
On Sex, 2009-01-09 at 14:13 -0500, David Abrahams wrote:
> > When you create a ZFS pool on a whole disk in Solaris (or you add a new,
> > whole disk to an existing pool), the zpool command will use libdiskmgt
> > to create an EFI/GPT disk label on the disk with a single large
> > partition (and also a very small one, but this is irrelevant), and then
> > it will use this partition to store the data.
>
> So, IIUC, you are saying that even though the official ZFS
> recommendation is to use whole disks, when you do that, Solaris actually
> partitions the disk under the covers anyway?

Yes.

> If so, why the emphasis on using whole disks? If not... I'm
> really confused!

Because when you use a whole disk in Solaris, ZFS will automatically
enable the disk write cache, because it knows that there is no other
filesystem there that could become damaged if the power fails (such as
the UFS filesystem).

So performance will be better and your disks will last longer.

> >> I have no idea how FreeBSD handles pools on raw disks, though (but my
> > guess is that it would work).
>
> You mean, you guess it would work without the label?

Yes, but it's only a guess.

> > My recommendation is to always create an msdos label on your disk,
> > because there are some EFI/GPT incompatibilities between Solaris and
> > Linux. You can do this with the 'fdisk' program.
>
> So... Solaris can read msdos-labelled disks?

Yes.

Cheers,
Ricardo

sghe...@hotmail.com

unread,
Jan 9, 2009, 2:45:45 PM1/9/09
to zfs-...@googlegroups.com
May I say that it is great to have you here Ricardo! Someone with the
prompt, direct, and proper answers. Big help!

Ricardo M. Correia wrote:
> On Sex, 2009-01-09 at 14:13 -0500, David Abrahams wrote:
>
>>> When you create a ZFS pool on a whole disk in Solaris (or you add a new,
>>>

And I love the way your (portugese?) mailer wants to write 'On Sex' all
the time :) I suppose that is only funny on my side of the globe...

Regards,
Seth

David Abrahams

unread,
Jan 9, 2009, 2:57:13 PM1/9/09
to zfs-...@googlegroups.com

on Fri Jan 09 2009, "Ricardo M. Correia" <Ricardo.M.Correia-AT-Sun.COM> wrote:

> On Sex, 2009-01-09 at 14:13 -0500, David Abrahams wrote:
>> > When you create a ZFS pool on a whole disk in Solaris (or you add a new,
>> > whole disk to an existing pool), the zpool command will use libdiskmgt
>> > to create an EFI/GPT disk label on the disk with a single large
>> > partition (and also a very small one, but this is irrelevant), and then
>> > it will use this partition to store the data.
>>
>> So, IIUC, you are saying that even though the official ZFS
>> recommendation is to use whole disks, when you do that, Solaris actually
>> partitions the disk under the covers anyway?
>
> Yes.
>
>> If so, why the emphasis on using whole disks? If not... I'm
>> really confused!
>
> Because when you use a whole disk in Solaris, ZFS will automatically
> enable the disk write cache, because it knows that there is no other
> filesystem there that could become damaged if the power fails (such as
> the UFS filesystem).
>
> So performance will be better and your disks will last longer.

Okay, this is good news; I feel better about the partitioning
requirement now. To me it sounds like zfs-fuse should perhaps refuse to
use whole disks in the Linux way, or at least give a loud warning.

Two tangential questions:

* Do I get control over the disk write cache in Linux?

* If I have battery backup for the whole server, should I have to worry
about enabling the disk write cache even if I share disks between zfs
and some other FS?

>> >> I have no idea how FreeBSD handles pools on raw disks, though (but my
>> > guess is that it would work).
>>
>> You mean, you guess it would work without the label?
>
> Yes, but it's only a guess.

Roger.

>> > My recommendation is to always create an msdos label on your disk,
>> > because there are some EFI/GPT incompatibilities between Solaris and
>> > Linux. You can do this with the 'fdisk' program.
>>
>> So... Solaris can read msdos-labelled disks?
>
> Yes.

Thanks, that's a big help.

Joseph Mulloy

unread,
Jan 9, 2009, 3:04:36 PM1/9/09
to zfs-...@googlegroups.com
On Fri, Jan 9, 2009 at 2:57 PM, David Abrahams <da...@boostpro.com> wrote:

* If I have battery backup for the whole server, should I have to worry
 about enabling the disk write cache even if I share disks between zfs
 and some other FS?


The risk with write caching on the disk should be the same for non-ZFS filesystems as it would be if the disk didn't also contain a ZFS filesystem. Unless of course ZFS-Fuse does something really funky, but I doubt it.

Joseph Mulloy

jdmu...@gmail.com

sghe...@hotmail.com

unread,
Jan 9, 2009, 3:15:12 PM1/9/09
to zfs-...@googlegroups.com

> The risk with write caching on the disk should be the same for non-ZFS
> filesystems as it would be if the disk didn't also contain a ZFS
> filesystem. Unless of course ZFS-Fuse does something really funky, but
> I doubt it.
Don't! Read the docs for ZFS. You'll see that both meta-data and object
data are truly atomically updated. This is what they mean when they brag
about it 'actually being cheaper to take a snapshot than not to'

Just my 20 cents

Seth

David Abrahams

unread,
Jan 9, 2009, 3:26:30 PM1/9/09
to zfs-...@googlegroups.com

Is your point that the atomic updates will no longer be atomic if
there's another filesystem?

sghe...@hotmail.com

unread,
Jan 9, 2009, 3:35:16 PM1/9/09
to zfs-...@googlegroups.com
Dave wrote:
> Is your point that the atomic updates will no longer be atomic if
> there's another filesystem
Not at all.

The point is that enabling write-caching affects *all* disk accesses, so
including any other mounted filesystems (or even raw block access for
that matter). Other FS-es might rely on write ordering and stuff like
that ini order to stay consistent. They won't be happy if their
'sibling' filesystem chose to enable dangerous features on the disk they
are sharing.

Think: good filesystem neighborship

sghe...@hotmail.com

unread,
Jan 9, 2009, 3:37:50 PM1/9/09
to zfs-...@googlegroups.com
Or perhaps clearer: not at all. Disabling write-caching is always safer
(regardless of fs in use).
It's just that ZFS can tolerate cache loss in case of power failure (and
most other failuremodes, for that matter) a lot better than your
ordinary linux filesystem.

Joseph Mulloy

unread,
Jan 9, 2009, 3:44:38 PM1/9/09
to zfs-...@googlegroups.com
My Point was that having ZFS also on the disk makes no difference to the other filesystem as far as how well it can tolerate an unclean shutdown. Does ZFS-Fuse mess with the cache settings? Newer Linux filesystems use barriers to make write cahcing "safer" for metadata. All filesystems on a disk are subject to the same cache setting, but a file system shouldn't be affected by any of the other filesystems on the disk. The only issue would be if ZFS-Fuse is messing with the cache settings, which I think should be left to the administrator/user.

Joseph Mulloy

jdmu...@gmail.com

David Abrahams

unread,
Jan 9, 2009, 4:06:28 PM1/9/09
to zfs-...@googlegroups.com

on Fri Jan 09 2009, "sgheeren-AT-hotmail.com" <sgheeren-AT-hotmail.com> wrote:

> Dave wrote:
>> Is your point that the atomic updates will no longer be atomic if
>> there's another filesystem
>
> Not at all.
>
> The point is that enabling write-caching affects *all* disk accesses, so
> including any other mounted filesystems (or even raw block access for
> that matter).

Sure.

> Other FS-es might rely on write ordering and stuff like that in order
> to stay consistent.

Do you mean to tell me that write caching itself is, de-rigeur, not
compatible with the requirements of ordinary filesystems like XFS?

> They won't be happy if their 'sibling' filesystem chose to enable
> dangerous features on the disk they are sharing.
>
> Think: good filesystem neighborship

AFAICT write cacheing XFS is OK as long as write barriers are enabled:
http://tinyurl.com/xfsfaq

David Abrahams

unread,
Jan 9, 2009, 4:07:35 PM1/9/09
to zfs-...@googlegroups.com

on Fri Jan 09 2009, "sgheeren-AT-hotmail.com" <sgheeren-AT-hotmail.com> wrote:

Yeah, but as I said, I have battery backup on the whole system.
Shouldn't that be enough to eliminate the power failure worries?

Omen Wild

unread,
Jan 9, 2009, 4:37:06 PM1/9/09
to zfs-...@googlegroups.com
Quoting David Abrahams <da...@boostpro.com> on Fri, Jan 09 16:07:
>
> Yeah, but as I said, I have battery backup on the whole system.
> Shouldn't that be enough to eliminate the power failure worries?

That does not cover crashes. Same impact, system goes down without
finishing disk activity.

--
Blessed are the young for they shall inherit the national debt.

Jonathan Schmidt

unread,
Jan 9, 2009, 4:48:33 PM1/9/09
to zfs-...@googlegroups.com
>
> Quoting David Abrahams <da...@boostpro.com> on Fri, Jan 09 16:07:
>>
>> Yeah, but as I said, I have battery backup on the whole system.
>> Shouldn't that be enough to eliminate the power failure worries?
>
> That does not cover crashes. Same impact, system goes down without
> finishing disk activity.

Do you actually trust your battery backup? I've only had moderate success
with them. They are typically poorly maintained and infrequently (or
never) tested. I still use one of course, I just don't trust it :)

Joseph Mulloy

unread,
Jan 9, 2009, 4:50:51 PM1/9/09
to zfs-...@googlegroups.com
For all practical purposes you should be fine. What does the system do any way? Is it a very important server, a home desktop or something in between? Even if you don't consider OS crashes, the filesystem could have a bug. You're never going to get a 100% guarantee of data integrity. I think the checksumming in ZFS is more important than the atomic writes, as far as I know the type of corruption that can be caught and fixed by ZFS checksums happens more frequently than crashes.

Joseph Mulloy

jdmu...@gmail.com

David Abrahams

unread,
Jan 9, 2009, 5:24:45 PM1/9/09
to zfs-...@googlegroups.com

I don't know whether I trust it, which is even worse ;-)

David Abrahams

unread,
Jan 9, 2009, 5:27:21 PM1/9/09
to zfs-...@googlegroups.com

on Fri Jan 09 2009, "Joseph Mulloy" <jdmulloy-AT-gmail.com> wrote:

> For all practical purposes you should be fine. What does the system do any
> way? Is it a very important server, a home desktop or something in
> between?

I hope to make it very important. I hope to keep all the really
important stuff in ZFS, though, so if a crash or power outage corrupts
the XFS, I'm not too worried -- I'll have a backup of it in the ZFS FS
anyway.

> Even if you don't consider OS crashes, the filesystem could have a bug.
> You're never going to get a 100% guarantee of data integrity. I think the
> checksumming in ZFS is more important than the atomic writes, as far as I
> know the type of corruption that can be caught and fixed by ZFS checksums
> happens more frequently than crashes.

Interesting.

Thanks,

sghe...@hotmail.com

unread,
Jan 9, 2009, 6:03:46 PM1/9/09
to zfs-...@googlegroups.com

>> Even if you don't consider OS crashes, the filesystem could have a bug.
>> You're never going to get a 100% guarantee of data integrity. I think the
>> checksumming in ZFS is more important than the atomic writes, as far as I
>> know the type of corruption that can be caught and fixed by ZFS checksums
>> happens more frequently than crashes.
>>
>
> Interesting.
>
Interesting since paradoxical: "Even the checksumming could have a bug" ...

By the way: checksumming without meta-data integrity ('atomic writes')
yields nothing but a neatly checksummed pile of rubble if you ask me.
It's not the one without the other.

Also: 100% guarantee is not what we are after. We have off-site backups
to save us if the O-bomb hits our house. Presumably, the off-site backup
is properly checksummed and trustworthy :)

With many mainstream filesystems (at least in some common performance
modes, like ext3 'ordered-writes' if I remember correctly) enabling
write-caching will prevent the fs driver from doing a reliable (journal)
recovery in the face of a simple powerfailure [2] (without considering
'complications' like hardware damage). This would be what most people
'require' for everyday file storage.

Seth


[1] Once entropy reaches infinity, data can be proven to be impossible :)
[2] I would go out on a limb and conjecture that simple power
failures/hard shut-offs happen a lot more than crashes and 'silent data
corruption' (aka bit-rot) combined :)

PS: on the person that epxressed scepsis on the use of backup-power:
Don't forget this should only be enough power to complete the writing of
unflushed buffers/caches in the controller/kernel pipelines. That would
normally take seconds, not minutes. An extreme case of it would be
delay-writing of e.g. a usb-stick of 4g with more than 4g buffering. You
can see that that takes in the 30s once you issue 'sync' or 'umount'
mostly because of the inferior speed of flash storage.

Jonathan Schmidt

unread,
Jan 9, 2009, 6:25:41 PM1/9/09
to zfs-...@googlegroups.com
> PS: on the person that epxressed scepsis on the use of backup-power:
> Don't forget this should only be enough power to complete the writing of
> unflushed buffers/caches in the controller/kernel pipelines. That would
> normally take seconds, not minutes. An extreme case of it would be
> delay-writing of e.g. a usb-stick of 4g with more than 4g buffering. You
> can see that that takes in the 30s once you issue 'sync' or 'umount'
> mostly because of the inferior speed of flash storage.

You are implicitly trusting that the computer will properly receive
notification that the UPS has lost mains power and shutdown is imminent.
It's just as bad to have a UPS that runs your system for 25 minutes after
a power loss, but when it finally dies the computer loses power without
even knowing something was wrong.

It's ironic that my city just had a momentary power hiccup, and that one
of my three UPS battery backups did not survive, and the computer attached
to it was powered off immediately. 66% success rate in a single trial
isn't good. Now I have to go replace that UPS -- it won't even turn on
anymore :(

Also, if you can find me a 4GB USB stick that can write 4G of data in 30
seconds, I'll buy it off of you. For starters, it would need multiple
USB2.0 links (or USB3?).

My point was simply: How often do you test your power loss contingency
setup? How confident are you that your PC will do the right thing
(unattended) if mains power is lost?

Joseph Mulloy

unread,
Jan 9, 2009, 6:42:13 PM1/9/09
to zfs-...@googlegroups.com
On Fri, Jan 9, 2009 at 6:25 PM, Jonathan Schmidt <j...@jschmidt.ca> wrote:

> PS: on the person that epxressed scepsis on the use of backup-power:
> Don't forget this should only be enough power to complete the writing of
> unflushed buffers/caches in the controller/kernel pipelines. That would
> normally take seconds, not minutes. An extreme case of it would be
> delay-writing of e.g. a usb-stick of 4g with more than 4g buffering. You
> can see that that takes in the 30s once you issue 'sync' or 'umount'
> mostly because of the inferior speed of flash storage.

As long as you turn off / flush the cache as soon as you lose power instead of waiting for the flush to occur during the OS shutdown.

You are implicitly trusting that the computer will properly receive
notification that the UPS has lost mains power and shutdown is imminent.
It's just as bad to have a UPS that runs your system for 25 minutes after
a power loss, but when it finally dies the computer loses power without
even knowing something was wrong.


It's not really that hard to setup apcupsd.
 
It's ironic that my city just had a momentary power hiccup, and that one
of my three UPS battery backups did not survive, and the computer attached
to it was powered off immediately.  66% success rate in a single trial
isn't good.  Now I have to go replace that UPS -- it won't even turn on
anymore :(

Also, if you can find me a 4GB USB stick that can write 4G of data in 30
seconds, I'll buy it off of you.  For starters, it would need multiple
USB2.0 links (or USB3?).

My point was simply:  How often do you test your power loss contingency
setup?  How confident are you that your PC will do the right thing
(unattended) if mains power is lost?

1. Install apcupsd
2. Configure
3. Pull the plug of the UPS
4. See if it works

Not that hard.

I've been wanting to use ZFS but my home server is currently running Linux. My SATA card (Promise TX4) is not at all supported by OpenSolaris and things like my shorewall config can't be carried over since it's specific to netfilter. I know ZFS-Fuse isn't yet tunned for permormance, but unless the performance really sucks, I doubt it would be important to me. What I want to know is how safe is it? Is data corruption a regular occurance or can it be trusted. I've also though about FreeBSD, but according to their wiki ZFS causes kernel panics, so I'd rather stay away from that. I'm curently using Reiserfs3 and it's perfomed well for me over the years.

Is ZFS-Fuse stable enough that it's advantages, such as checksumming outweigh it's risks when compared with Reiserfs3? If ZFS-Fuse is sufficently stable the checksumming and atomic writes could possibly make it less risky than Reiserfs.

Joseph Mulloy

jdmu...@gmail.com

Chris Samuel

unread,
Jan 10, 2009, 3:12:50 AM1/10/09
to zfs-...@googlegroups.com
On Sat, 10 Jan 2009 7:44:38 am Joseph Mulloy wrote:

> Newer Linux filesystems use barriers to make write cahcing "safer" for
> metadata.

*Only* if those underlying devices support it *and* as long as you're not
using the device-mapper layer.

http://hightechsorcery.com/2008/06/linux-write-barriers-write-caching-lvm-and-
filesystems

cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

signature.asc

Joseph Mulloy

unread,
Jan 10, 2009, 3:40:08 AM1/10/09
to zfs-...@googlegroups.com
Sorry forgot to mention the device-mapper thing. I use LVM and md-raid on my home server so the write barriers don't do me any good. Another advantage for ZFS-Fuse, you get RAID, Volume Management and atomic writes. Is anyone here using it for important data? I suppose I could try it and have a nightly tar backup to a reiserfs filesystem just in case. Do the NFS and CIFS sharing through the ZFS commands work or am I better off just configuring in the standard config files?

Joseph Mulloy

jdmu...@gmail.com


On Sat, Jan 10, 2009 at 3:12 AM, Chris Samuel <ch...@csamuel.org> wrote:
On Sat, 10 Jan 2009 7:44:38 am Joseph Mulloy wrote:

> Newer Linux filesystems use barriers to make write caching "safer" for
> metadata.

sghe...@hotmail.com

unread,
Jan 10, 2009, 5:35:14 AM1/10/09
to zfs-...@googlegroups.com
Jonathan:

> My point was simply: How often do you test your power loss contingency
> setup? How confident are you that your PC will do the right thing
> (unattended) if mains power is lost?
>
Ok That is a very valid criticism and very solid advice. It is in fact
the reason I didn't bother with UPS in the end (though I did a lot of
research on it. I found it tough-ish to find a simple yet effective UPS
for use with multiple computers (Both Windows and Linux)).

I *did* find, however, that any decent (non-too-cheapo) UPS will have a
test function, so you can at least test the triggering (software) setup.

I decided at that moment in time that UPS was overkill/unreliable for my
home use. I went with rigorous on- and off-site backups. It's easier to
do, easier to monitor (you can see that it *still* works on a day-to-day
basis), it puts me in control (I can keep things as simple as I wish,
the only complexity being the crypto keys that are in the safe in...
PRINT!). As long as I keep a working knowledge of current hardware and
OS-es I'll be able to simply restore at the cost of jsut bandwidth.

I could still be screwed of course, if a global war takes out both my
home and the internet. But hey, I imagine I have other things to worry
about then.

In my opinion, UPS-es are never meant to sustain power across grid
failures. Instead, the small-business/consumer UPS-es seem to focus on
notification and clean shutdown (system preservation). Only bigtime HA
providers, medical or military facilities will usually have proper
emergency generators. In fact I think that most of these institutions
have learned from past experience to do the same:
1. notify
2. power down
3. power up generators
4. controlled one-by-one restart of vital systems.


Cheers,
Seth

sghe...@hotmail.com

unread,
Jan 10, 2009, 5:41:43 AM1/10/09
to zfs-...@googlegroups.com
You have me at a loss when you intend to move to ZFS for day-to-day and
then Reiser (?????!) as a backup?!

That should be wholly reversed for a number of reasons.

Reiser is that thing that performs well, is flexible but has a proven
history of being very easy to mess-up especially if you need to recover
anything. In general the ODF is so 'generic' that many things could be
mistaken for a Reiser filesystem image. (Never try to recover/fix a
reiser fs on a disk that contains, e.g. a tarred Reiser image as a
file...). Just google around. Ext3 has much better recoverability track
records.

ZFS (at least zfs-fuse) has a lower performance, a big CPU footprint on
read/write, but excellent (triple-A) data and meta-data consistency.

I use XFS for day-to-day use, local backups on ZFS (because of the data
checksumming).

Thoughts?

Joseph Mulloy wrote:
> Sorry forgot to mention the device-mapper thing. I use LVM and md-raid
> on my home server so the write barriers don't do me any good. Another
> advantage for ZFS-Fuse, you get RAID, Volume Management and atomic
> writes. Is anyone here using it for important data? I suppose I could
> try it and have a nightly tar backup to a reiserfs filesystem just in
> case. Do the NFS and CIFS sharing through the ZFS commands work or am
> I better off just configuring in the standard config files?
>
> Joseph Mulloy
>
> jdmu...@gmail.com <mailto:jdmu...@gmail.com>

unicron

unread,
Jan 10, 2009, 6:51:48 AM1/10/09
to zfs-fuse
Hi all,

Thanks for the answer Ricardo! It is now clear to me why partitions
are the way to go.

I'm currently backing up my 3x1TB software raid5 and 3x1TB raidz pool
to 6xUSB disks in a raidzpool. If this is completed I will use the
6x1TB to create a raidz pool on them. This gives me an extra 1TB,
because I then will only have 1 disk for parity instead of 2.

I've also been looking at the CPU usage while making the backup, since
it is said ZFS is cpu intensive. When running top on my Ubuntu server
i see that the memory usage (RES) is about 250m and the CPU usage is
fluctuating. It seems to be using a single CPU core for ~90% for about
2 seconds and then 2 seconds its almost idle. This pattern keeps on
repeating. So it seems ZFS-fuse only uses a single core, or perhaps
the USB disk are the bottleneck that prevents zfs-fuse from needing
the second core to keep up.

On 9 jan, 19:21, "Ricardo M. Correia" <Ricardo.M.Corr...@Sun.COM>
wrote:

Joseph Mulloy

unread,
Jan 10, 2009, 5:48:34 PM1/10/09
to zfs-...@googlegroups.com
On Sat, Jan 10, 2009 at 5:41 AM, sghe...@hotmail.com <sghe...@hotmail.com> wrote:

You have me at a loss when you intend to move to ZFS for day-to-day and
then Reiser (?????!) as a backup?!

That should be wholly reversed for a number of reasons.

Reiser is that thing that performs well, is flexible but has a proven
history of being very easy to mess-up especially if you need to recover
anything. In general the ODF is so 'generic' that many things could be
mistaken for a Reiser filesystem image. (Never try to recover/fix a
reiser fs on a disk that contains, e.g. a tarred Reiser image as a
file...). Just google around. Ext3 has much better recoverability track
records.

Never had any problems with it. I suppose I could use ext3 for a backup since performance isn't critical.
 

ZFS (at least zfs-fuse) has a lower performance, a big CPU footprint on
read/write, but excellent (triple-A) data and meta-data consistency.

I use XFS for day-to-day use, local backups on ZFS (because of the data
checksumming).

This won't protect you from silent data corruption on the XFS filesystem, if a file becomes corrupt then the backup copy that lives on ZFS will become corrupt. Over time if you have to delete old backups to reclaim disk space then the backups will all be corrupt. ZFS will however protect the data at rest on the backup.

I've been looking at a lot of options, such as running Debian on a Xen Dom0 and OpenSolaris on a DomU. I got it running on my laptop, but from what I've been reading there are performance issues, but my network doesn't really require that much performance. It also looks like the Linux world is switching from Xen to KVM, but my servers VIA C7 doesn't have virtualization instructions. I'm starting to lean towards having a two server setup. I have a 333Mhz Celeron with 256MB RAM and various IDE disks I could use for the firewalling and network functions. I could install OpenSolaris on the Everex GPC that I'm currently using as a server and use it as just a fileserver. The tricky thing would be my APC UPS. But apcupsd has the capability to control multiple machines on the same UPS over the network. The only thing I'd be morried about is that my UPS is a base model with a small battery, I get about 10 minutes (It's not completely dead when the software shuts it down) with just the GPC, the network switch and modem hooked up. The SATA card I bought won't work with Solaris but the board has two 1.5Gb/s ports that are supported by the IDE driver. I won't get hotinsert support, but with RAID the machine can stay up until I can shut it down to replace the disk, which I would have to buy anyway.
 

Joseph Mulloy

jdmu...@gmail.com

sghe...@hotmail.com

unread,
Jan 10, 2009, 6:06:22 PM1/10/09
to zfs-...@googlegroups.com
Wow Joseph, I'm really jealous of that setup you describe.

I don't have the means to put all that machinery up. I wouldn't want to,
actually. As it happens I'm running my server business on a AMD Geode
board (silent, low power) and my workstation (to my horror) doesn't like
Solaris. So...

Of course have checksummed data all the way beats anything less :) Rock
beats scissors, or sumtin' like that[1]

I solve it by keeping very longterm snapshots. That way if anything got
corrupted I imagine I could take the loss and revert to a reasonably old
snapshot version of that specific file.

Also, I must admit gambling on the (admittedly subjective) opinion that
"I never witnessed the effect(s) of silent data corruption"[2]... so; I
know that would just bite me in the end, which is *why* I do use ZFS.

Think about it: silent data corruption never manifested it because
1. it is silent (qed)
2. my data never lived long enough (I just threw away my college
writings, early software dev etc).
3. I won't notice bits wrong in my jpgs
4. I never use the data, but would be disappointed to find it unusable
once I wanted to
5. a lot of the data used to be my OS, development stuff; this stuff got
recycled about bi-yearly anyway
6. data got replicated form harddisk to harddisk before the disks grew old
7. harddisks at that time were less vulnerable (less GB/square inch,
less GB/$); especially the capacity thing might hurt in the long run.
Data storage still is a magnetism game, after all



[1] I'm Dutch. I wouldn't know, really:)
[2] that includes monitoring zfs stats since I began using it

Joseph Mulloy wrote:
> On Sat, Jan 10, 2009 at 5:41 AM, sghe...@hotmail.com
> <mailto:sghe...@hotmail.com> <sghe...@hotmail.com
> <mailto:jdmu...@gmail.com <mailto:jdmu...@gmail.com>>
> >
> >
> > On Sat, Jan 10, 2009 at 3:12 AM, Chris Samuel <ch...@csamuel.org
> <mailto:ch...@csamuel.org>
> > <mailto:ch...@csamuel.org <mailto:ch...@csamuel.org>>> wrote:
> >
> > On Sat, 10 Jan 2009 7:44:38 am Joseph Mulloy wrote:
> >
> > > Newer Linux filesystems use barriers to make write caching
> > "safer" for
> > > metadata.
> >
> > *Only* if those underlying devices support it *and* as long as
> > you're not
> > using the device-mapper layer.
> >
> >
> http://hightechsorcery.com/2008/06/linux-write-barriers-write-caching-lvm-and-
> > filesystems
> >
> > cheers,
> > Chris
> > --
> > Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
> >
> > This email may come with a PGP signature as a file. Do not
> panic.
> > For more info see: http://en.wikipedia.org/wiki/OpenPGP
> >
> >
> >
> > >
>
>
> Joseph Mulloy
>
> jdmu...@gmail.com <mailto:jdmu...@gmail.com>
>
> >

David Abrahams

unread,
Jan 10, 2009, 8:04:12 PM1/10/09
to zfs-...@googlegroups.com

on Sat Jan 10 2009, Chris Samuel <chris-AT-csamuel.org> wrote:

> On Sat, 10 Jan 2009 7:44:38 am Joseph Mulloy wrote:
>
>> Newer Linux filesystems use barriers to make write cahcing "safer" for
>> metadata.
>
> *Only* if those underlying devices support it *and* as long as you're not
> using the device-mapper layer.

Nooooooooooooooo! My beloved LVM makes things less safe?

> http://hightechsorcery.com/2008/06/linux-write-barriers-write-caching-lvm-and-
> filesystems

/me gnashes teeth

Looks like EVMS is no bettrer in this regard, being built on
device-mapper MD. :(

sghe...@hotmail.com

unread,
Jan 10, 2009, 8:09:41 PM1/10/09
to zfs-...@googlegroups.com

> Nooooooooooooooo! My beloved LVM makes things less safe?
>
> [...]
> /me gnashes teeth
>
Evms is no better (and on the way out). I had the initial shock of
discovering this when I researched the following lines from
/var/log/messages

kernel: [ 16.393477] SGI XFS with ACLs, security attributes, realtime,
large block numbers, no debug enabled
kernel: [ 16.393933] SGI XFS Quota Management subsystem
kernel: [ 16.400657] Filesystem "dm-1": Disabling barriers, trial
barrier write failed
kernel: [ 16.407272] XFS mounting filesystem dm-1

David Abrahams

unread,
Jan 10, 2009, 8:12:43 PM1/10/09
to zfs-...@googlegroups.com

on Sat Jan 10 2009, "Joseph Mulloy" <jdmulloy-AT-gmail.com> wrote:

> Sorry forgot to mention the device-mapper thing. I use LVM and md-raid on my
> home server so the write barriers don't do me any good.

Hmm, makes me reconsider again how much of my space to devote to ZFS.
Since I am going to be (among other things) running a virtual build/test
farm, it seems to me that I want the fastest possible storage for
intermediate files, etc., integrity be damned (repositories are hosted
elsewhere). I probably want to reserve the use of ZFS for those things
that I can't afford to lose.

> Another advantage for ZFS-Fuse, you get RAID, Volume Management and
> atomic writes. Is anyone here using it for important data? I suppose I
> could try it and have a nightly tar backup to a reiserfs filesystem
> just in case. Do the NFS and CIFS sharing through the ZFS commands
> work or am I better off just configuring in the standard config files?

IIUC you need to go through the regular Linux interfaces for NFS.
Worked for me when I did it. You also need to have a version of Fuse
that's configured to allow FUSE filesystems to be shared. I don't know
the status of CIFS but I'd guess it's the same.

David Abrahams

unread,
Jan 10, 2009, 8:18:50 PM1/10/09
to zfs-...@googlegroups.com

on Sat Jan 10 2009, "Joseph Mulloy" <jdmulloy-AT-gmail.com> wrote:

From what I understand, Xen has architectural advantages that mean KVM
can't really approach it. But that said, KVM is a lot easier to work with.

I wanted to run Xen on the server, but ran into the same thing because I
need to virtualize Windows there and you can't do that without
virtualization instructions. So I'm sticking with VMWare Server 2. You
can get a free NexentaStor VM that will manage up to 6TB of ZFS for you
if you want to play with ZFS that way.

> I'm starting to lean towards having a two server setup.

Yeah, these days it's really hard not to want the virtualization
instructions. One day I'll probably spin off the VMs to a different
machine and dedicate this server to fileserving, so I can run Solaris or
FreeBSD there.

Fajar A. Nugraha

unread,
Jan 10, 2009, 11:40:50 PM1/10/09
to zfs-...@googlegroups.com
On Sun, Jan 11, 2009 at 8:04 AM, David Abrahams <da...@boostpro.com> wrote:
>
>
> on Sat Jan 10 2009, Chris Samuel <chris-AT-csamuel.org> wrote:
>
>> On Sat, 10 Jan 2009 7:44:38 am Joseph Mulloy wrote:
>>
>>> Newer Linux filesystems use barriers to make write cahcing "safer" for
>>> metadata.
>>
>> *Only* if those underlying devices support it *and* as long as you're not
>> using the device-mapper layer.
>
> Nooooooooooooooo! My beloved LVM makes things less safe?
>
>> http://hightechsorcery.com/2008/06/linux-write-barriers-write-caching-lvm-and-
>> filesystems
>
> /me gnashes teeth
>

The way I see it, if you need to use LVM (or anything based on device
mapper) and want to keep it safe, you should disable the disk's write
cache.

On zfs-fuse however, things are somewhat complicated because :
- when using LVM, disk write cache needs to be disabled (thus reducing
performance)
- bypassing LVM/MD (i.e. using whole disk or partition) means you can
turn on write cache, but that means you need zfs raidz (unless your
hardware already performs raid).
- zfs-fuse raidz is much slower (about half the performance) of zfs +
linux raid5

Those three factors combined, the feasible choices are :
- zfs on top of LVM/MD with disk write cache disabled, if you trust MD raid
- zfs raidz on real disks, if you can live with much lower performance

Regards,

Fajar

Chris Samuel

unread,
Jan 11, 2009, 2:46:45 AM1/11/09
to zfs-...@googlegroups.com
On Sat, 10 Jan 2009 9:41:43 pm sghe...@hotmail.com wrote:

> I use XFS for day-to-day use, local backups on ZFS (because of the data
> checksumming).

That's pretty much what I'm doing now, except I've swapped from XFS to ext4.

signature.asc

David Abrahams

unread,
Jan 12, 2009, 6:11:43 PM1/12/09
to zfs-...@googlegroups.com

on Sat Jan 10 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:

> On Sun, Jan 11, 2009 at 8:04 AM, David Abrahams <da...@boostpro.com> wrote:
>>
>>
>> on Sat Jan 10 2009, Chris Samuel <chris-AT-csamuel.org> wrote:
>>
>>> On Sat, 10 Jan 2009 7:44:38 am Joseph Mulloy wrote:
>>>
>>>> Newer Linux filesystems use barriers to make write cahcing "safer" for
>>>> metadata.
>>>
>>> *Only* if those underlying devices support it *and* as long as you're not
>>> using the device-mapper layer.
>>
>> Nooooooooooooooo! My beloved LVM makes things less safe?
>>
>>> http://hightechsorcery.com/2008/06/linux-write-barriers-write-caching-lvm-and-
>>> filesystems
>>
>> /me gnashes teeth
>>
>
> The way I see it, if you need to use LVM (or anything based on device
> mapper) and want to keep it safe, you should disable the disk's write
> cache.

OK, but I'm going to let it be unsafe. I need some fast scratch area;
I'll back up to ZFS for the stuff that counts.

> On zfs-fuse however, things are somewhat complicated because :
> - when using LVM, disk write cache needs to be disabled (thus reducing
> performance)
> - bypassing LVM/MD (i.e. using whole disk or partition) means you can
> turn on write cache, but that means you need zfs raidz (unless your
> hardware already performs raid).

I think copies=2 or copies=3 would work too.

> - zfs-fuse raidz is much slower (about half the performance) of zfs +
> linux raid5

So what does "zfs+linux raid5" really mean?

> Those three factors combined, the feasible choices are :
> - zfs on top of LVM/MD with disk write cache disabled, if you trust MD raid
> - zfs raidz on real disks, if you can live with much lower performance

How about:

* all disks sliced into a couple of partitions

* dmraid (or mdraid? I can't keep those two straight) striping across
all 1st partitions for unreliable but really fast storage

* zpool across all 2nd partitions.

* ZFS filesystems with copies=2 and copies=3

?

What I'm most unsure about with this scheme is how to lay out partitions
and/or LVM volumes to maintain the most flexibility to grow either the
fast part or the reliable part as I use the system.

Fajar A. Nugraha

unread,
Jan 12, 2009, 9:41:30 PM1/12/09
to zfs-...@googlegroups.com
David Abrahams wrote:
> on Sat Jan 10 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:
>
>> On zfs-fuse however, things are somewhat complicated because :
>> - when using LVM, disk write cache needs to be disabled (thus reducing
>> performance)
>> - bypassing LVM/MD (i.e. using whole disk or partition) means you can
>> turn on write cache, but that means you need zfs raidz (unless your
>> hardware already performs raid).
>>
>
> I think copies=2 or copies=3 would work too.
>
>

copies=n protects agains "bad sector" (or something like that). It does
not protect against broken disk. e.g if you use copies=2 without zfs
mirror/raidz/raidz2 on JBOD, and one of the disks is broken, you can say
"bye bye data ..."

>> - zfs-fuse raidz is much slower (about half the performance) of zfs +
>> linux raid5
>>
>
> So what does "zfs+linux raid5" really mean?
>
>

zfs on top of md raid5, or zfs on top of lvm on top of md.

>> Those three factors combined, the feasible choices are :
>> - zfs on top of LVM/MD with disk write cache disabled, if you trust MD raid
>> - zfs raidz on real disks, if you can live with much lower performance
>>
>
> How about:
>
> * all disks sliced into a couple of partitions
>
> * dmraid (or mdraid? I can't keep those two straight) striping across
> all 1st partitions for unreliable but really fast storage
>
> * zpool across all 2nd partitions.
>
> * ZFS filesystems with copies=2 and copies=3
>
> ?
>
> What I'm most unsure about with this scheme is how to lay out partitions
> and/or LVM volumes to maintain the most flexibility to grow either the
> fast part or the reliable part as I use the system.
>
>

If you can live with (currently) slow zfs-fuse raidz and have lots of
disks, it's easier to simply have dedicated disks for zfs, create a
partition on each disks that occupy all available space, and create
zpool with raidz/raidz2 on those partitions.

Regards,

Fajar

Jonathan Schmidt

unread,
Jan 13, 2009, 11:00:15 AM1/13/09
to zfs-...@googlegroups.com
>> I think copies=2 or copies=3 would work too.
>
> copies=n protects agains "bad sector" (or something like that). It does
> not protect against broken disk. e.g if you use copies=2 without zfs
> mirror/raidz/raidz2 on JBOD, and one of the disks is broken, you can say
> "bye bye data ..."

Not necessarily. First of all I don't think ZFS even has a "JBOD" mode.
If you add your disks as top level vdevs the default behaviour is
striping (similar to RAID0).

Secondly the copies=2/3 chunk allocator uses a best effort strategy to
locate the data and ditto blocks on different devices. There are
limitations to using copies=2/3 for pool redundancy but in general it
does work if you remove or crash a drive. Hopefully I'll get some free
time at some point and I can look into enhancing this (already great)
feature.

David Abrahams

unread,
Jan 13, 2009, 12:58:58 PM1/13/09
to zfs-...@googlegroups.com

on Mon Jan 12 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:

> copies=n protects agains "bad sector" (or something like that). It does
> not protect against broken disk. e.g if you use copies=2 without zfs
> mirror/raidz/raidz2 on JBOD, and one of the disks is broken, you can say
> "bye bye data ..."

I don't think so; see Jonathan Schmidt's response.

>>> - zfs-fuse raidz is much slower (about half the performance) of zfs +
>>> linux raid5
>>>
>>
>> So what does "zfs+linux raid5" really mean?
>>
>>
>
> zfs on top of md raid5, or zfs on top of lvm on top of md.

OK, but that still has a raid-5 write hole, right?

>>> Those three factors combined, the feasible choices are :
>>> - zfs on top of LVM/MD with disk write cache disabled, if you trust MD raid
>>> - zfs raidz on real disks, if you can live with much lower performance
>>>
>>
>> How about:
>>
>> * all disks sliced into a couple of partitions
>>
>> * dmraid (or mdraid? I can't keep those two straight) striping across
>> all 1st partitions for unreliable but really fast storage
>>
>> * zpool across all 2nd partitions.
>>
>> * ZFS filesystems with copies=2 and copies=3
>>
>> ?
>>
>> What I'm most unsure about with this scheme is how to lay out partitions
>> and/or LVM volumes to maintain the most flexibility to grow either the
>> fast part or the reliable part as I use the system.
>>
>
> If you can live with (currently) slow zfs-fuse raidz and have lots of
> disks,

I have 8. Don't know if that's "lots."

> it's easier to simply have dedicated disks for zfs, create a partition
> on each disks that occupy all available space, and create zpool with
> raidz/raidz2 on those partitions.

Yes, it's easier. However, I'm not sure how my storage needs will
evolve, so it's not necessarily more workable.

Fajar A. Nugraha

unread,
Jan 16, 2009, 4:49:50 PM1/16/09
to zfs-...@googlegroups.com
On Tue, Jan 13, 2009 at 11:00 PM, Jonathan Schmidt <j...@jschmidt.ca> wrote:
>
>>> I think copies=2 or copies=3 would work too.
>>
>> copies=n protects agains "bad sector" (or something like that). It does
>> not protect against broken disk. e.g if you use copies=2 without zfs
>> mirror/raidz/raidz2 on JBOD, and one of the disks is broken, you can say
>> "bye bye data ..."
>
> the copies=2/3 chunk allocator uses a best effort strategy to
> locate the data and ditto blocks on different devices. There are
> limitations to using copies=2/3 for pool redundancy but in general it
> does work if you remove or crash a drive.


Have you tried this? Is there a test case for this?
My tests so far indicate that if :
- I have zpool on multiple vdefs (using files, for test purposes),
striped (not mirror, not raidz), created with

# dd if=/dev/zero of=/tmp/disk1 bs=1M seek=100 count=0
0+0 records in
0+0 records out
0 bytes (0 B) copied, 4.7763e-05 s, 0.0 kB/s
# dd if=/dev/zero of=/tmp/disk2
0+0 records in
0+0 records out
0 bytes (0 B) copied, 4.8191e-05 s, 0.0 kB/s
# zpool create test /tmp/disk1 /tmp/disk2

- set copies=2 (or more)
- put some test files on it
- export zpool
- remove one of the vdefs (in this case /tmp/test2)
- try to import the pool

It will fail

# zpool import -d /tmp
pool: test
id: 17946218857558819078
state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
see: http://www.sun.com/msg/ZFS-8000-6X
config:

test UNAVAIL missing device
/tmp/disk1 ONLINE

Additional devices are known to be part of this pool, though their
exact configuration cannot be determined.


So I'm not sure what you mean by "but in general it does work if you
remove or crash a drive", unless I'm doing something wrong with the
test.

Jonathan Schmidt

unread,
Jan 16, 2009, 5:07:09 PM1/16/09
to zfs-...@googlegroups.com

Yes, I have tested that configuration and I did not encounter the error
you did ("Additional devices are known to be part of this pool, though
their exact configuration cannot be determined"). I got something more
like this:

test DEGRADED (I think?)
/tmp/disk1 ONLINE
/tmp/disk2 MISSING

It wasn't exactly that (but something close) and unfortunately I can't
test it out at the moment. However, the pool was importable, mountable,
and the files were intact. I'm curious though, are your temp "disk" files
0 bytes long? Mine were 100MB.

Fajar A. Nugraha

unread,
Jan 16, 2009, 5:22:23 PM1/16/09
to zfs-...@googlegroups.com
On Sat, Jan 17, 2009 at 5:07 AM, Jonathan Schmidt <j...@jschmidt.ca> wrote:
>>> the copies=2/3 chunk allocator uses a best effort strategy to
>>> locate the data and ditto blocks on different devices. There are
>>> limitations to using copies=2/3 for pool redundancy but in general it
>>> does work if you remove or crash a drive.
>>
>>
>> Have you tried this? Is there a test case for this?

> Yes, I have tested that configuration and I did not encounter the error


> you did ("Additional devices are known to be part of this pool, though
> their exact configuration cannot be determined"). I got something more
> like this:
>
> test DEGRADED (I think?)
> /tmp/disk1 ONLINE
> /tmp/disk2 MISSING
>

Are you sure it was a stripe setup? It seems like a message for a
mirror/raidz setup

> It wasn't exactly that (but something close) and unfortunately I can't
> test it out at the moment. However, the pool was importable, mountable,
> and the files were intact. I'm curious though, are your temp "disk" files
> 0 bytes long? Mine were 100MB.
>

It was 0 before pool creation because it was created as a sparse file
(with the seek=100 and count=0). Now it occupies about 1.3 MB.

Jonathan Schmidt

unread,
Jan 16, 2009, 5:32:26 PM1/16/09
to zfs-...@googlegroups.com
>>>> the copies=2/3 chunk allocator uses a best effort strategy to
>>>> locate the data and ditto blocks on different devices. There are
>>>> limitations to using copies=2/3 for pool redundancy but in general it
>>>> does work if you remove or crash a drive.
>>>
>>>
>>> Have you tried this? Is there a test case for this?
>
>> Yes, I have tested that configuration and I did not encounter the error
>> you did ("Additional devices are known to be part of this pool, though
>> their exact configuration cannot be determined"). I got something more
>> like this:
>>
>> test DEGRADED (I think?)
>> /tmp/disk1 ONLINE
>> /tmp/disk2 MISSING
>>
>
> Are you sure it was a stripe setup? It seems like a message for a
> mirror/raidz setup

I'm positive it was striped (that was the whole reason for the test).
"DEGRADED" might be the wrong status, hence the uncertainty on that line.
The disk2 did list as missing though. Your pool seems to have forgotten
about it entirely.

I ran this test a long time ago (back when 0.4.0 was new) so maybe the
behaviour has changed since then? I'll test it again when I get the
chance.

>> It wasn't exactly that (but something close) and unfortunately I can't
>> test it out at the moment. However, the pool was importable, mountable,
>> and the files were intact. I'm curious though, are your temp "disk"
>> files
>> 0 bytes long? Mine were 100MB.
>>
>
> It was 0 before pool creation because it was created as a sparse file
> (with the seek=100 and count=0). Now it occupies about 1.3 MB.

Right.

David Abrahams

unread,
Jan 16, 2009, 6:35:00 PM1/16/09
to zfs-...@googlegroups.com

on Fri Jan 16 2009, "Jonathan Schmidt" <jon-AT-jschmidt.ca> wrote:

>>> Yes, I have tested that configuration and I did not encounter the error
>>> you did ("Additional devices are known to be part of this pool, though
>>> their exact configuration cannot be determined"). I got something more
>>> like this:
>>>
>>> test DEGRADED (I think?)
>>> /tmp/disk1 ONLINE
>>> /tmp/disk2 MISSING
>>>
>>
>> Are you sure it was a stripe setup? It seems like a message for a
>> mirror/raidz setup
>
> I'm positive it was striped (that was the whole reason for the test).
> "DEGRADED" might be the wrong status, hence the uncertainty on that line.
> The disk2 did list as missing though. Your pool seems to have forgotten
> about it entirely.
>
> I ran this test a long time ago (back when 0.4.0 was new) so maybe the
> behaviour has changed since then? I'll test it again when I get the
> chance.

Once again, patiently waiting for new results before deciding how to
proceed :-)

Thanks all,

Fajar A. Nugraha

unread,
Jan 17, 2009, 12:37:19 AM1/17/09
to zfs-...@googlegroups.com
On Sat, Jan 17, 2009 at 6:35 AM, David Abrahams <da...@boostpro.com> wrote:
> Once again, patiently waiting for new results before deciding how to
> proceed :-)

Here's what I'd do :
(1) Use hardware raid or Linux MD to provide redundancy
Yes, it's possible to have raid-5 write hole. Having a good storage
controller with battery-backed cache to handle raid helps prevent
this.
(2) Depending on your setup, you may need to turn OFF write caching on
each disk to ensure data safety.
(3) Setup LVM on top of MD
(4) Allocate PVs for zfs-fuse or ext3 as necessary, leaving room to
grow. For example, from 1TB VG, allocate 100GB for zfs-fuse and 100GB
for ext3. Leave 800GB untouched for now.
(5) Create zpool using only one LV as vdef. After that, either :
- turn of checksum, OR
- leave zfs checksum on and have copies=2 for important data
(6) If you need more space for zfs, create another LV, and add that LV
as another vdef
(7) If you need more space for ext3, grow the LV and use resize2fs

This setup should give you most flexibility, and a balance between
performance and data safety.

Regards,

Fajar

David Abrahams

unread,
Jan 18, 2009, 3:47:11 PM1/18/09
to zfs-...@googlegroups.com

on Sat Jan 17 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:

> On Sat, Jan 17, 2009 at 6:35 AM, David Abrahams <da...@boostpro.com> wrote:
>> Once again, patiently waiting for new results before deciding how to
>> proceed :-)
>
> Here's what I'd do :
> (1) Use hardware raid or Linux MD to provide redundancy
> Yes, it's possible to have raid-5 write hole. Having a good storage
> controller with battery-backed cache to handle raid helps prevent
> this.

Why accept a raid-5 write hole for my critical data if I don't have to?

> (2) Depending on your setup, you may need to turn OFF write caching on
> each disk to ensure data safety.

Yeah, not necessary if I do what I'm planning, since all critical data
would go on ZFS.

> (3) Setup LVM on top of MD
> (4) Allocate PVs for zfs-fuse or ext3 as necessary, leaving room to
> grow. For example, from 1TB VG, allocate 100GB for zfs-fuse and 100GB
> for ext3. Leave 800GB untouched for now.
> (5) Create zpool using only one LV as vdef. After that, either :
> - turn of checksum, OR
> - leave zfs checksum on and have copies=2 for important data
> (6) If you need more space for zfs, create another LV, and add that LV
> as another vdef
> (7) If you need more space for ext3, grow the LV and use resize2fs
>
> This setup should give you most flexibility, and a balance between
> performance and data safety.


Hum. Downsides:

1. I can't read the disks on anything but a Linux system. If I want to
run FreeBSD or Solaris to get native ZFS, I'm out of luck.

2. RAID-5 write hole

I think I should slice each disk into two physical partitions and make a
ZFS pool out of a partition from each disk, so I can mount the pool from
some other OS. I'd use the other partitions with LVM. It would be
ideal if it were possible to grow these zfs partitions later, when I am
ready to toss out the LVM stuff, but I'm not sure if that's really
possible. Anybody know?

Fajar A. Nugraha

unread,
Jan 18, 2009, 9:20:24 PM1/18/09
to zfs-...@googlegroups.com
David Abrahams wrote:
> on Sat Jan 17 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:
>
>> Here's what I'd do :
>> (1) Use hardware raid or Linux MD to provide redundancy
>> Yes, it's possible to have raid-5 write hole. Having a good storage
>> controller with battery-backed cache to handle raid helps prevent
>> this.
>>
>
> Why accept a raid-5 write hole for my critical data if I don't have to?
>
>
Let's see :
- Are you using a storage controller that can't present the real disk to
the OS (some HP Proliant models come to mind)?
- Do you need more I/O throughput than what zfs-fuse raidz can give you?
- Do you often need to change the disk allocation between ext3 and zfs?

If yes, then you might need to accept the POSSIBILITY of raid-5 write
hole, or switch a different server/OS combination. Note that
battery-backed cache on some storage controllers can help reduce the
possibility of the write hole.

If not, your original setup sounds good enough.

>> This setup should give you most flexibility, and a balance between
>> performance and data safety.
>>
>
>
> Hum. Downsides:
>
> 1. I can't read the disks on anything but a Linux system. If I want to
> run FreeBSD or Solaris to get native ZFS, I'm out of luck.
>
> 2. RAID-5 write hole
>
>

Correct. At this point I'd say you need to prioritize. What's on top of
your list? Is it performance? scalability? Data integrity?


> I think I should slice each disk into two physical partitions and make a
> ZFS pool out of a partition from each disk, so I can mount the pool from
> some other OS. I'd use the other partitions with LVM. It would be
> ideal if it were possible to grow these zfs partitions later, when I am
> ready to toss out the LVM stuff, but I'm not sure if that's really
> possible. Anybody know?
>
>

It's possible, as in you can use the space previously used by LVM as
vdevs (in stripe/raidz/raidz2, your choice) and use the vdev as :
- a new pool, or
- as an addition to grow your old pool

What you can't do is change the raidz configuration from (say) 8-vdev
raidz to 16-vdev raidz

Regards,

Fajar

Jonathan Schmidt

unread,
Jan 18, 2009, 9:26:37 PM1/18/09
to zfs-...@googlegroups.com
> It's possible, as in you can use the space previously used by LVM as
> vdevs (in stripe/raidz/raidz2, your choice) and use the vdev as :
> - a new pool, or
> - as an addition to grow your old pool
>
> What you can't do is change the raidz configuration from (say) 8-vdev
> raidz to 16-vdev raidz

No, but you can add a second 8-vdev raidz, keeping the same level of
redundancy. Switching to a 16-vdev raidz would lower the fault
tolerance of the array.

Fajar A. Nugraha

unread,
Jan 18, 2009, 9:50:29 PM1/18/09
to zfs-...@googlegroups.com
Jonathan Schmidt wrote:
>> It's possible, as in you can use the space previously used by LVM as
>> vdevs (in stripe/raidz/raidz2, your choice) and use the vdev as :
>> - a new pool, or
>> - as an addition to grow your old pool
>>
>> What you can't do is change the raidz configuration from (say) 8-vdev
>> raidz to 16-vdev raidz
>>
>
> No, but you can add a second 8-vdev raidz, keeping the same level of
> redundancy.
That's what I meant with "as an addition to grow your old pool" :D
What I was highlighting is that you can't modify an existing raidz vdev
by adding more drives or changing the drives with larger ones to add
size, but you can add another disk/mirror/raidz to that pool (preferably
of the same type, to keep the same level of redundancy). Thanks for
making it clearer.

Regards,

Fajar

David Abrahams

unread,
Jan 18, 2009, 11:56:11 PM1/18/09
to zfs-...@googlegroups.com

on Sun Jan 18 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:

> David Abrahams wrote:
>> on Sat Jan 17 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:
>>
>>> Here's what I'd do :
>>> (1) Use hardware raid or Linux MD to provide redundancy
>>> Yes, it's possible to have raid-5 write hole. Having a good storage
>>> controller with battery-backed cache to handle raid helps prevent
>>> this.
>>>
>>
>> Why accept a raid-5 write hole for my critical data if I don't have to?
>>
>>
> Let's see :
> - Are you using a storage controller that can't present the real disk to
> the OS (some HP Proliant models come to mind)?

I don't think so.

> - Do you need more I/O throughput than what zfs-fuse raidz can give
> you?

I'd like that, but I don't need that part of the storage to be
ZFS-reliable.

> - Do you often need to change the disk allocation between ext3 and zfs?

Who uses ext3? ;-)

No, not yet.

> If yes, then you might need to accept the POSSIBILITY of raid-5 write
> hole, or switch a different server/OS combination. Note that
> battery-backed cache on some storage controllers can help reduce the
> possibility of the write hole.
>
> If not, your original setup sounds good enough.
>
>>> This setup should give you most flexibility, and a balance between
>>> performance and data safety.
>>>
>>
>>
>> Hum. Downsides:
>>
>> 1. I can't read the disks on anything but a Linux system. If I want to
>> run FreeBSD or Solaris to get native ZFS, I'm out of luck.
>>
>> 2. RAID-5 write hole
>>
>>
> Correct. At this point I'd say you need to prioritize. What's on top of
> your list? Is it performance? scalability? Data integrity?

I hate to choose ;-)

>> I think I should slice each disk into two physical partitions and make a
>> ZFS pool out of a partition from each disk, so I can mount the pool from
>> some other OS. I'd use the other partitions with LVM. It would be
>> ideal if it were possible to grow these zfs partitions later, when I am
>> ready to toss out the LVM stuff, but I'm not sure if that's really
>> possible. Anybody know?
>>
> It's possible, as in you can use the space previously used by LVM as
> vdevs (in stripe/raidz/raidz2, your choice) and use the vdev as :
> - a new pool, or
> - as an addition to grow your old pool

I don't think using it to grow an old pool with copies=xxx would be a
good plan, because then copies are more likely to end up on the same
disk. I think that leaves adding a new raid-z pool. Still, I would
prefer to end up with whole-disk vdevs if I throw out the LVM parts.

> What you can't do is change the raidz configuration from (say) 8-vdev
> raidz to 16-vdev raidz

Yeah, that's one reason to go with copies=xxx redundancy instead.

Fajar A. Nugraha

unread,
Jan 19, 2009, 1:07:13 AM1/19/09
to zfs-...@googlegroups.com
David Abrahams wrote:
> on Sun Jan 18 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:
>
>
>> - Do you often need to change the disk allocation between ext3 and zfs?
>>
>
> Who uses ext3? ;-)
>
Okay, whatever non-zfs filesystem that you use then :)

>> At this point I'd say you need to prioritize. What's on top of
>> your list? Is it performance? scalability? Data integrity?
>>
>
> I hate to choose ;-)
>
>

Welcome to the real world :D
As a side note, an opensolaris advocate would probably say that :
- switching to Opensolaris, or
- using an external opensolaris-based storage server and export the
disks using iscsi or nfs
would get you all of those three.


>>> I think I should slice each disk into two physical partitions and make a
>>> ZFS pool out of a partition from each disk, so I can mount the pool from
>>> some other OS. I'd use the other partitions with LVM. It would be
>>> ideal if it were possible to grow these zfs partitions later, when I am
>>> ready to toss out the LVM stuff, but I'm not sure if that's really
>>> possible. Anybody know?
>>>
>>>
>> It's possible, as in you can use the space previously used by LVM as
>> vdevs (in stripe/raidz/raidz2, your choice) and use the vdev as :
>> - a new pool, or
>> - as an addition to grow your old pool
>>
>
> I don't think using it to grow an old pool with copies=xxx would be a
> good plan, because then copies are more likely to end up on the same
> disk.

Who said anything about copies=n?
Growing your old pool would mean that if you initially do

zpool create datapool raidz sdb2 sdc2 sdd2 sde2 sdf2

you can grow it later using

zpool add datapool raidz sdb1 sdc1 sdd1 sde1 sdf1

This is assuming sd[b-f]1 is previously used for linux partitions.
>From zfs's point of view, the first and second raidz's redundancy would
be managed independently. So if (say) sdb is broken, two resilver
processes takes place : one for the sdx2 series, and one for the sdx1
series.

> I think that leaves adding a new raid-z pool. Still, I would
> prefer to end up with whole-disk vdevs if I throw out the LVM parts.
>
>

Yeah, that would be good. The problem is AFAICT growing/shrinking a
raidz vdev is not passible ATM. You'd need to export-import data. Again,
this depends on your priorities. If it were me, I'd live with
partition-vdev instead oh having to export-import several hundred GBs of
data.

Regards,

Fajar

David Abrahams

unread,
Jan 19, 2009, 9:24:58 PM1/19/09
to zfs-...@googlegroups.com

on Mon Jan 19 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:

> David Abrahams wrote:
>> on Sun Jan 18 2009, "Fajar A. Nugraha" <fajar-AT-fajar.net> wrote:
>>
>>
>>> - Do you often need to change the disk allocation between ext3 and zfs?
>>>
>>
>> Who uses ext3? ;-)
>>
> Okay, whatever non-zfs filesystem that you use then :)
>
>>> At this point I'd say you need to prioritize. What's on top of
>>> your list? Is it performance? scalability? Data integrity?
>>>
>>
>> I hate to choose ;-)
>>
>>
> Welcome to the real world :D
> As a side note, an opensolaris advocate would probably say that :
> - switching to Opensolaris, or
> - using an external opensolaris-based storage server and export the
> disks using iscsi or nfs
> would get you all of those three.

I know. Unfortunately there are a few other restrictions like the fact
that I want to be able to virtualize FreeBSD on this machine, which
isn't a well supported guest of by VirtualBox. Also I have some
existing VMware VMs that need to work.

>>>> I think I should slice each disk into two physical partitions and make a
>>>> ZFS pool out of a partition from each disk, so I can mount the pool from
>>>> some other OS. I'd use the other partitions with LVM. It would be
>>>> ideal if it were possible to grow these zfs partitions later, when I am
>>>> ready to toss out the LVM stuff, but I'm not sure if that's really
>>>> possible. Anybody know?
>>>>
>>>>
>>> It's possible, as in you can use the space previously used by LVM as
>>> vdevs (in stripe/raidz/raidz2, your choice) and use the vdev as :
>>> - a new pool, or
>>> - as an addition to grow your old pool

You can add a new raidz2 to an existing pool? Awesome.

>> I don't think using it to grow an old pool with copies=xxx would be a
>> good plan, because then copies are more likely to end up on the same
>> disk.
>
> Who said anything about copies=n?

I did, remember?

> Growing your old pool would mean that if you initially do
>
> zpool create datapool raidz sdb2 sdc2 sdd2 sde2 sdf2
>
> you can grow it later using
>
> zpool add datapool raidz sdb1 sdc1 sdd1 sde1 sdf1

Yeah, I like that.

> This is assuming sd[b-f]1 is previously used for linux partitions.
>>From zfs's point of view, the first and second raidz's redundancy would
> be managed independently. So if (say) sdb is broken, two resilver
> processes takes place : one for the sdx2 series, and one for the sdx1
> series.

That's tolerable I think.

>> I think that leaves adding a new raid-z pool. Still, I would
>> prefer to end up with whole-disk vdevs if I throw out the LVM parts.
>>
>>
> Yeah, that would be good. The problem is AFAICT growing/shrinking a
> raidz vdev is not passible ATM. You'd need to export-import data. Again,
> this depends on your priorities. If it were me, I'd live with
> partition-vdev instead oh having to export-import several hundred GBs of
> data.

Yeah, me too. So now I'm thinking:

+-----+-----+-----+-----+-----+-----+-----+-----+
| sda | sdb | sdc | sdd | sde | sdf | sdg | sdh |
+=====+=====+=====+=====+=====+=====+=====+=====+
|XXXXX|XXXXX|XXXXX|XXXXX|XXXXX|XXXXX|XXXXX|XXXXX|
+-----+-----+-----+-----+-----+-----+-----+-----+
|YYYYY|YYYYY|YYYYY|YYYYY|YYYYY|YYYYY|YYYYY|YYYYY|
+-----+-----+-----+-----+-----+-----+-----+-----+
|ZZZZZ|ZZZZZ|ZZZZZ|ZZZZZ|ZZZZZ|ZZZZZ|ZZZZZ|ZZZZZ|
+-----+-----+-----+-----+-----+-----+-----+-----+
| | | | | | | | |
. .
. Uncommitted space .
. .
| | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+


Where:

XXXXs are a dmraid RAID5 containing Linux boot and root. "counting
on" my UPS to mitigate the chance of disaster here.

YYYYs are a dmraid RAID0 for a volatile scratch area (e.g. object
files created when tests are built)

ZZZZs are RAIDZ2 for critical data.

Write cache enabled.

I can add more of X, Y, or Z in the uncommitted space as my needs
evolve. Probably by the time I need another slice of Zs, BTRFS will be
ready for action anyway ;-)

Reply all
Reply to author
Forward
0 new messages